SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
who coined RNAseq? RNAseq as an alignment first approach only brachysclereid Bioinformatics 3 01-10-2012 01:17 PM
expected mutation rates? adaptivegenome Genomic Resequencing 3 12-21-2011 01:59 AM
RNAseq - low cluster density - possible inhibitor? OnUrMark Sample Prep / Library Generation 11 05-07-2011 08:05 PM
HiSeq 2000 RNA-Seq Mapping Rates Lee Sam Illumina/Solexa 9 12-09-2010 06:39 AM
Error rates chriscampbell19 Illumina/Solexa 0 08-02-2010 02:05 PM

Reply
 
Thread Tools
Old 04-24-2009, 06:19 AM   #21
jmaury
Junior Member
 
Location: Paris

Join Date: Mar 2008
Posts: 4
Default

Hi Kevin,

Quote:
Originally Posted by klh View Post
Hi Jean-marc,
I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.
Be carefull, the number of reads which cross splice boundaries is 9 * 24 * 2 (*2 becaus reads before AND after the splice site cross the junction), so it gives ~29% of reads crossing splice junctions.

But according to the Mortazavi et al paper, they've mapped RNA-Seq against the mouse transcriptome, so reads that come from an unknwon splice junction won't fall into the categorie of reads that cross a splice boundaries... However the number of unknown splice junctions in the mouse genome may be low !!!

JM
jmaury is offline   Reply With Quote
Old 04-24-2009, 10:12 AM   #22
alim
Member
 
Location: pasadena, ca

Join Date: Jun 2008
Posts: 14
Default fraction of splice-crossing reads is a function of read length

Hi,

It's nice to see a spirited discussion of what the fractions of reads crossing splice junctions should be (and to be quoted in that debate ). I'll just restate what is in retrospect obvious, based on looking at a couple year's worth of data.

The fraction of reads crossing known or novel junctions is a function of both the read length and the distribution of exons lengths. For mammalian genomes like human and mouse (I claim no knowledge of the grapevine genome), less than 3% of 25 mers end up crossing a known splice junctions; this number in the same samples & genomes goes up to 20% by the time you use 75 mers. This numbers does not change dramatically in a well annotated genome when you include novel splices!

It's worth pointing out that since RNA-seq is a numbers game, many people, such as our lab, used raw reads rather than the solexa quality-filtered reads (and we can discuss whether Illumina's pre-1.0 quality scores or filters made any sense at all); this strategy is referred to sometimes as "quality-filtering by mapping".

One particularly interesting aspect of Illumina's reads is that a fraction of the reads accumulate most of the errors in the 3' end of the reads [in fact we took 32mers and truncated them into 25mers just to avoid the errors accumulating in bases 26-32]. Given the error rates with the old reagents, it's not surprising at all that 40% of the reads would have more than 2 errors in 25 bp.

Nowadays, 2x75 reads on a good run with the new reagents are of far better quality than those 25mers.

Ali
alim is offline   Reply With Quote
Old 04-24-2009, 11:03 AM   #23
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 124
Default Confused

Quote:
Originally Posted by alim View Post
Hi,

The fraction of reads crossing known or novel junctions is a function of both the read length and the distribution of exons lengths. For mammalian genomes like human and mouse (I claim no knowledge of the grapevine genome), less than 3% of 25 mers end up crossing a known splice junctions; this number in the same samples & genomes goes up to 20% by the time you use 75 mers. This numbers does not change dramatically in a well annotated genome when you include novel splices!

It's worth pointing out that since RNA-seq is a numbers game, many people, such as our lab, used raw reads rather than the solexa quality-filtered reads (and we can discuss whether Illumina's pre-1.0 quality scores or filters made any sense at all); this strategy is referred to sometimes as "quality-filtering by mapping".

One particularly interesting aspect of Illumina's reads is that a fraction of the reads accumulate most of the errors in the 3' end of the reads [in fact we took 32mers and truncated them into 25mers just to avoid the errors accumulating in bases 26-32]. Given the error rates with the old reagents, it's not surprising at all that 40% of the reads would have more than 2 errors in 25 bp.
Ali

Great to have you here! Thanks for the explanation on filtering. That really clear things a lot.

Part of the paper can be confusing. The RNA-seq was carried out on 3 types of tissues (brain, liver, muscle). The Methods stated that total RNA is pooled from tissues. That statement is misleading.

It'll be great you can clear me out here:
-The total data in your supplementary table 1 doesn't add up to 140 million reads
-What I understand from the paper is that all the "unmappable" reads that neither map to the genome and splice junctions are potentially novel transcript, is that right?
(Just to clarify from the author although others have stated that in the previous post)
-Regarding the percentage of reads at splice junctions...Don't you think that splice-spanning reads should be divided by (total reads that map to genome) instead of the entire datasets?

By now, everyone know that 25 bp reads are too short to map anything uniquely. With reads of only 25bp long, I would imagine all the reads will align to the genome.

Thanks in advance,
Melissa

Oops...I realized I made a big mistake when I said "Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. " I forgot that RNA-seq data is not supposed to contain introns (unless intron retention occurs... which is less likely to happen compared to plants).
Melissa is offline   Reply With Quote
Old 04-24-2009, 06:32 PM   #24
alim
Member
 
Location: pasadena, ca

Join Date: Jun 2008
Posts: 14
Default

Hi Melissa,

Let me see if I can answer your questions.

The reference to "pooled tissues" is that the RNA-samples (coming from a third party) were not from a single mouse brain but from many mouse brains pooled (same goes for the liver sample and the muscle sample) and not that we mixed brain+liver+splice in most of the analysis or in any of the sequencing. Note that these RNA samples are from the inbred mouse strain that was sequenced, so we do not expect to have many transcripts affected by variation/SNPs/etc.....

- The total raw reads for all of the reads, including the unmapped ones adds up to 216M reads, but since only 65% of the reads mapped on average, we used 140M reads in the analysis.

- Remember that we map the reads to the mouse reference genome in addition to known splice junctions. Assuming that novel transcripts are like existing transcripts, then most of their reads should map to the mouse reference genome. While there are undoubtedly thousands of junctions that we are missing, these are unlikely to be from highly expressed transcripts (say the top 10 expressed genes in muscle or liver), which are the best characterized transcripts & contribute disproportionately to the read counts.

- A better explanation for the 40% that aren't mapping is that they are likely to have the same distribution of unique/splice/multi reads as the other 60%, but with more than 2 mismatches... most of them would map to the same places as the other reads, which means primarily on known transcripts!

- the numbers for supplemental table 1 are just for illustration purposes. ERANGE does all of its calculations using mapped (unique + spliced + multi) reads.

- actually we see a surprising amount of intron retention in some genes! Clearly, there is still much work to be done in getting the genes models to be comprehensive even in genomes as extensively studied as human and mouse (and C elegans & drosophila too).

I hope this helps!

Ali
alim is offline   Reply With Quote
Old 04-26-2009, 08:20 AM   #25
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 124
Default

Thanks Ali for the detailed explanation. Longer reads will definitely reduce unmappable reads as showed here. Longer read length increase unique reads in RNA-seq.
Melissa is offline   Reply With Quote
Old 04-26-2009, 11:57 AM   #26
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Melissa View Post
Thanks Ali for the detailed explanation. Longer reads will definitely reduce unmappable reads as showed here. Longer read length increase unique reads in RNA-seq.
That's not necessarily true. Consider 100bp paired end reads. You could have reads that have more than one exon junction (even three or four if the exons are small). This is a significant bioinformatic challenge, especially when we wish to find novel transcripts and new exons.

In the later case, you cannot map to all known or predicted exons (or all possible junctions) since you might be missing new exons (especially if the sample has a high mutation rate i.e. cancer, which could create gene fusions). So really you have to map back the full DNA reference, tolerating that a read can span many junctions. My cursory look has not found anything that can handle millions, if not billions of these reads (we want to sequence 1000's of human samples at good coverage), while searching for such junctions.

Finally, the mapping above is with very low sensitivity. If you allow only 2 mismatches, then that is requiring that the reads you map have <2% error rate (100bp reads), which is unrealistic and creates the majority of "unmapped" reads (you were not sensitive enough).

On the other hand, this means that people like us have a job!
nilshomer is offline   Reply With Quote
Old 04-26-2009, 03:11 PM   #27
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Quote:
Originally Posted by nilshomer View Post
That's not necessarily true. Consider 100bp paired end reads. You could have reads that have more than one exon junction (even three or four if the exons are small). This is a significant bioinformatic challenge, especially when we wish to find novel transcripts and new exons.
This should not be a problem (for mapping) as long as your aligner is capable of finding the best match in the read. AFAIK it is only one method that claims to do this though, but at least AB is starting to go in this direction with the RNA-seq pipeline (by identifying the best match from either end of the read). And of course you could split your 2x100 b reads into 8x25 and use ISAS to map those reads really fast.
Chipper is offline   Reply With Quote
Old 04-26-2009, 03:20 PM   #28
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Chipper View Post
This should not be a problem (for mapping) as long as your aligner is capable of finding the best match in the read. AFAIK it is only one method that claims to do this though, but at least AB is starting to go in this direction with the RNA-seq pipeline (by identifying the best match from either end of the read). And of course you could split your 2x100 b reads into 8x25 and use ISAS to map those reads really fast.
1. Please explain to me how the "best" match would translate directly (without further reconstruction) into the current isoform/transcript. I would like to know how the "best match" works with a read covering 3 exons.

2. You cannot split the reads since you do not know where the splice junctions exist in the read. If you did split, you could use any aligner since they are all fast (with a few exceptions).
nilshomer is offline   Reply With Quote
Old 04-27-2009, 12:32 AM   #29
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Melissas statment was that longer reads increases the possibility of getting a unique (exonic) sequence, and the op's question was about the fraction of unmappale reads. So a read spanning 3 exons would then be correctly aligned to the middle exon. Of course as you say the problem lies in defining the isoforms. Re 2) I did not say you have to be able to align all parts of the reads, but rather to increase the likelihood of finding the alignment of a read compared to having only one 25 mere from the same transcript.
Chipper is offline   Reply With Quote
Old 04-27-2009, 12:50 AM   #30
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Definitely longer reads will help aid in the mapping process for any seq experiment.

It may not be necessarily correct to map to the middle exon. For example if the exons are 40bp, 20bp, and 20bp respectively. Then in the majority of the cases (reads spanning the exons) the first exon contributes the most sequence to the read. If you want to find the "best fit" exon, it would be the first.

As long as it is admitted that the ultimate goal is to reconstruct the isoforms, which the above is an intermediate step, or the goal is to measure gene expression, where we simply care about the abundance of transcripts from a given gene and so do not need to find the isoforms, then I would tend to agree with any intermediate approach.

Its a difficult problem, and I am still unsure of the best path of attack.
nilshomer is offline   Reply With Quote
Old 05-19-2009, 07:49 AM   #31
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Hello to everyone and thanks for adding to the discussion here. Sorry I did not have the time to reply. I really appreciated everyone's input and I even learned more things than I expected.

Alim, so from what I understand from your post, you think that the ~60% mapping rate of raw 25bp reads from your experiment is most likely due to the higher error rates of the old illumina sequencing chemistry.

The newer chemistry then, should be better and help to increase the mapping rates by reducing the sequencing errors. I will take a closer look at this once the longer read data sets are available.
NGSfan is offline   Reply With Quote
Old 08-27-2013, 06:14 AM   #32
Lizex
Member
 
Location: Cape Town South Africa

Join Date: Feb 2011
Posts: 22
Default RNA-Seq reads mapping very low

Hi All
I have been mapping reads generated on a MiSeq instrument (150bp) against the apple genome. Reads were processed (trimmed and filtered using the fastx_toolkit). Reads mapped are over a 100bp long. Using Tophat (version 1.4) to align the reads, I got an average of 12% of reads mapped for e.g. out of 5 Gb of reads (read1 and read2) only 600 Mb of data in the accepted_hits.bam file. I take this as reads mapped. Is there something wrong or am I missing something here? Any suggestions as to how address this very low mapping. I'm sure that the percentage reads should be much better than 12%?
Lizex is offline   Reply With Quote
Old 08-27-2013, 08:24 AM   #33
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by Lizex View Post
Hi All
I have been mapping reads generated on a MiSeq instrument (150bp) against the apple genome. Reads were processed (trimmed and filtered using the fastx_toolkit). Reads mapped are over a 100bp long. Using Tophat (version 1.4) to align the reads, I got an average of 12% of reads mapped for e.g. out of 5 Gb of reads (read1 and read2) only 600 Mb of data in the accepted_hits.bam file. I take this as reads mapped. Is there something wrong or am I missing something here? Any suggestions as to how address this very low mapping. I'm sure that the percentage reads should be much better than 12%?
Since you have paired-end data, be careful using fastx_toolkit. I've seen a lot of people desyncing their paired-end reads with it.
dpryan is offline   Reply With Quote
Old 08-27-2013, 08:33 AM   #34
lukas1848
Member
 
Location: Germany

Join Date: Jun 2011
Posts: 54
Default

Did you check for adapter contamination of your reads?
lukas1848 is offline   Reply With Quote
Old 08-27-2013, 01:22 PM   #35
Lizex
Member
 
Location: Cape Town South Africa

Join Date: Feb 2011
Posts: 22
Default Why low mapping

Thanks for the reply. I removed adaptors from the reads with cutadapt. I'm also curious how the reads map. Attached please see pic for how these map and also the presence of such a lot of N's. Should I worry about the N's and remove it or should I rather leave it. Any suggestions?
Attached Images
File Type: png Screen shot 2013-08-18 at 10.13.38 PM.png (129.4 KB, 7 views)
Lizex is offline   Reply With Quote
Old 08-27-2013, 01:24 PM   #36
Lizex
Member
 
Location: Cape Town South Africa

Join Date: Feb 2011
Posts: 22
Default

Quote:
Originally Posted by dpryan View Post
Since you have paired-end data, be careful using fastx_toolkit. I've seen a lot of people desyncing their paired-end reads with it.
Thanks, dpryan. What do suggest I use?
Lizex is offline   Reply With Quote
Old 08-27-2013, 02:11 PM   #37
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by Lizex View Post
Thanks, dpryan. What do suggest I use?
trim_galore or trimmomatic are common suggestions. I've had good luck in the past with trim_galore, which is also quite flexible.
dpryan is offline   Reply With Quote
Old 08-27-2013, 02:15 PM   #38
Lizex
Member
 
Location: Cape Town South Africa

Join Date: Feb 2011
Posts: 22
Default

Quote:
Originally Posted by dpryan View Post
trim_galore or trimmomatic are common suggestions. I've had good luck in the past with trim_galore, which is also quite flexible.
Thanks. I'll give it a try.
Lizex is offline   Reply With Quote
Old 08-29-2013, 02:43 AM   #39
Lizex
Member
 
Location: Cape Town South Africa

Join Date: Feb 2011
Posts: 22
Default

Quote:
Originally Posted by Lizex View Post
Thanks. I'll give it a try.
Hi dpryan

I've tried Trimmomatic. The number of reads i.e read1.fq and read2.fq are 1 492 345 for each. After mapping using Tophat 1.4.0, the stats of the accepted_hits.bam file looks like this:

samtools flagstat /Data_Analysis/E0.2.3/E0_tophat/accepted_hits.bam 1404454 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1404454 + 0 mapped (100.00%:nan%)
1404454 + 0 paired in sequencing
682904 + 0 read1
721550 + 0 read2
1200618 + 0 properly paired (85.49%:nan%)
1243330 + 0 with itself and mate mapped
161124 + 0 singletons (11.47%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Is this a good mapping or bad? How should I interpret this result?
Lizex is offline   Reply With Quote
Old 08-29-2013, 03:09 AM   #40
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That looks pretty reasonable. You started with ~1.5 million reads and aligned ~1.4 million, of which ~85% were properly paired. That's certainly a vast improvement over the original 12% mapping rate that you reported!
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:07 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO