SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
50 bp paired end reads vs. 100 bp single end reads efoss Bioinformatics 12 01-15-2014 08:05 PM
TopHat -paired end vs single end reads adarshjose RNA Sequencing 10 06-12-2012 06:15 PM
Can Cuffdiff treat paired-end and single-end reads at the same time? zun RNA Sequencing 3 06-12-2012 05:37 PM
RNA-seq: Replicates, single-end, paired-end story pasta Bioinformatics 2 07-04-2011 11:51 PM
BOth single and paired end reads in a file!! adgen Illumina/Solexa 0 06-30-2010 10:28 AM

Reply
 
Thread Tools
Old 09-15-2009, 07:01 AM   #1
warrenemmett
Member
 
Location: South Africa

Join Date: Nov 2008
Posts: 23
Default Can paired-end mapping produce more reads than single-end ?

Hi,

I am currently mapping 75bp paired end data to a cDNa library. I have heard that there is a chance there are chimeras in a large portion of the reads and as such have had to change to single end analysis.

My results seem to show that I have less hits (I counted reads mapped in both paired and single end alignment) in single-end alignment (using bowtie).

Does anyone know if this is infact possible or more likely an error in the code?
warrenemmett is offline   Reply With Quote
Old 09-15-2009, 11:32 PM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

It's certainly possible. Most paired end mapping algorithms are able to use mapping information about one end to infer a likely position for the other, meaning that a read which in isolation couldn't be mapped uniquely can be positioned if the position of its other end is known.

If you see a big difference in efficiency I'd also double check that you were using the same mapping parameters in both runs just to be sure. Using 75bp reads I'd be surprised if there was a big advantage to using paired end over single end mapping for a cDNA library.
simonandrews is offline   Reply With Quote
Old 09-16-2009, 03:45 AM   #3
warrenemmett
Member
 
Location: South Africa

Join Date: Nov 2008
Posts: 23
Default

Thanks for the reply! I had the same suspicion and after more searching have found the bug

Thanks again for the help!
warrenemmett is offline   Reply With Quote
Old 09-16-2009, 02:50 PM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by simonandrews View Post
It's certainly possible. Most paired end mapping algorithms are able to use mapping information about one end to infer a likely position for the other, meaning that a read which in isolation couldn't be mapped uniquely can be positioned if the position of its other end is known.

If you see a big difference in efficiency I'd also double check that you were using the same mapping parameters in both runs just to be sure. Using 75bp reads I'd be surprised if there was a big advantage to using paired end over single end mapping for a cDNA library.
I would argue that the strategy above shows a common misconception about paired end data (or mate-end). For the human genome, inferring one end from the other does not return much, due to the local nature of repeats (think of insert distributions that are >500bp wide and how local repeats would confound placing the unaligned end). I have seen no data to show what inferring one end from the other does to false mappings, especially around large-scale insertion, deletion, and translocation events. At my current state of thinking, using paired end constraints during mapping is a heuristic to make up for the fact you did not map each end sensitively enough.
nilshomer is offline   Reply With Quote
Old 09-16-2009, 11:44 PM   #5
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by nilshomer View Post
At my current state of thinking, using paired end constraints during mapping is a heuristic to make up for the fact you did not map each end sensitively enough.
I think that's an overgeneralisation. I don't believe that paired end mapping is a panacea, but there are certainly cases where it offers benefits in sensitivity over single end mapping.

You say that using paired end is a way to make up for inadequate initial mapping, but with shorter read lengths there are plenty of reads which could map exactly, with no mismatches at multiple locations in the genome. Even where these are within repeat regions you can find that there are only a small number of locations where this read could map then using a paired end will give you a mapped position where a single end would not.

Having said that, I'd actually argue that for mapping type applications (eg ChIP Seq) the benefit of paired end comes from the separation between ends. Repeated regions can stretch over many tens of bases so that increasing the length of a single end read provides diminishing returns in terms of mapping efficiency. Using shorter paired end reads with a greater separation between the ends will in many cases offer a greater chance of positioning a read since one end may have escaped the repetitive region.

Having said all this, I personally would stick with single end reads for ChIP-Seq, mRNA seq and similar applications from a cost point of view. We do a lot of paired end sequencing at our site but normally it's for applications which absolutely require it, such as 4C.
simonandrews is offline   Reply With Quote
Old 09-17-2009, 04:45 AM   #6
montera
Junior Member
 
Location: brasil

Join Date: Jul 2009
Posts: 5
Default

Does any body can help me defining: single-end a paired-end aligments? Thanks a lot
montera is offline   Reply With Quote
Old 09-17-2009, 09:46 AM   #7
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by simonandrews View Post
I think that's an overgeneralisation. I don't believe that paired end mapping is a panacea, but there are certainly cases where it offers benefits in sensitivity over single end mapping.

You say that using paired end is a way to make up for inadequate initial mapping, but with shorter read lengths there are plenty of reads which could map exactly, with no mismatches at multiple locations in the genome. Even where these are within repeat regions you can find that there are only a small number of locations where this read could map then using a paired end will give you a mapped position where a single end would not.

Having said that, I'd actually argue that for mapping type applications (eg ChIP Seq) the benefit of paired end comes from the separation between ends. Repeated regions can stretch over many tens of bases so that increasing the length of a single end read provides diminishing returns in terms of mapping efficiency. Using shorter paired end reads with a greater separation between the ends will in many cases offer a greater chance of positioning a read since one end may have escaped the repetitive region.

Having said all this, I personally would stick with single end reads for ChIP-Seq, mRNA seq and similar applications from a cost point of view. We do a lot of paired end sequencing at our site but normally it's for applications which absolutely require it, such as 4C.
Your idea of using paired end information is not flawed. My simple point is that a large fraction of repeats in Humans occur locally, so that with 1kb variability in insert sizes (see ABI SOLiD) the other end wont help. With most new technologies moving to longer reads (>50bp), I don't foresee short reads for whole genome sequencing remaining for long.
nilshomer is offline   Reply With Quote
Old 09-18-2009, 04:37 AM   #8
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

I'm a bit confused by the comment
Quote:
with 1kb variability in insert sizes (see ABI SOLiD) the other end wont help
. Perhaps it varies by who is preparing the library & how, but in the MoDIL paper their Illumina library had a fragment size distribution with a mean of 208 and standard deviation of 13, which is quite a tight distribution.

While I would agree that a lot of a human-like genome is unlikely to resolve with paired end mapping and current read lengths, for specific genes which may be of high interest (especially in an array/solution capture approach) this information can be critical. For example, if there are retrotransposed duplicates of your gene of interest, paired reads may enable distinguishing the two. This would happen either (a) one read in the pair maps into unique sequence (intron) for the original copy or (b) the distance between the read pairs is distinctive because they imply either crossing or not crossing an intron.
krobison is offline   Reply With Quote
Old 09-18-2009, 04:46 AM   #9
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by montera View Post
Does any body can help me defining: single-end a paired-end aligments? Thanks a lot
When you create a library of fragments to sequence some sequencing technologies offer the ability to sequence either from just one end of each fragment, or to get a pair of fragments, one from each end. In many cases these paired end reads won't meet up so you won't have the complete fragment sequence, but you have two sequences which you know should be separated by a fairly small distance in your reference sequence.

Some mapping tools can use the connection between the two tags from the same fragment to aid them in mapping the sequences to a reference.
simonandrews is offline   Reply With Quote
Old 09-18-2009, 07:11 AM   #10
warrenemmett
Member
 
Location: South Africa

Join Date: Nov 2008
Posts: 23
Default

Although there are definitely many more reads mapped for single-end, when I look at the rpkm values the paired end data, produces values 100-300 higher (just browsed through the top few genes)single-end.

Can anyone take a stab at what reasons for this are ? although less reads are mapped does it manage to give a more specific result in this case?
warrenemmett is offline   Reply With Quote
Old 09-18-2009, 08:52 AM   #11
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by krobison View Post
I'm a bit confused by the comment
. Perhaps it varies by who is preparing the library & how, but in the MoDIL paper their Illumina library had a fragment size distribution with a mean of 208 and standard deviation of 13, which is quite a tight distribution.

While I would agree that a lot of a human-like genome is unlikely to resolve with paired end mapping and current read lengths, for specific genes which may be of high interest (especially in an array/solution capture approach) this information can be critical. For example, if there are retrotransposed duplicates of your gene of interest, paired reads may enable distinguishing the two. This would happen either (a) one read in the pair maps into unique sequence (intron) for the original copy or (b) the distance between the read pairs is distinctive because they imply either crossing or not crossing an intron.
For an tight insert size Illumina library, I would agree, paired ends could help. But for large insert size ABI libraries, it is more ambiguous.

Nils
nilshomer is offline   Reply With Quote
Old 09-18-2009, 10:47 AM   #12
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Nils, do you have any results showing this or are you just guessing? If you have one end derived from an Alu repeat you would ve comparing it to ~1 M copies for singel end and perhaps 2-3 for paired ends with a 1 kb variabity so you should be able to find many more uniqe reads with paired ends.
Chipper is offline   Reply With Quote
Old 09-18-2009, 01:16 PM   #13
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Chipper View Post
Nils, do you have any results showing this or are you just guessing? If you have one end derived from an Alu repeat you would ve comparing it to ~1 M copies for singel end and perhaps 2-3 for paired ends with a 1 kb variabity so you should be able to find many more uniqe reads with paired ends.
The data on which I am basing my results is from actually trying this strategy in a version of my own mapping tool BFAST. The discordance between my results and your expectation may come from the sensitive settings I use for mappings (up to 10% raw error). I am always open to incorporating this strategy as it is trivial to implement. Nonetheless, I have myself neither performed nor seen how this strategy increases false-mappings for those cases when this strategy is used (I assessed only sensitivity).

This might actually be a good time to rigorously put this debate to rest with some simulations. What if I create some paired end data (from Human) with error-rates coming from our latest Illumina runs and check the false-mapping rates if this strategy is used? I can take those reads for which one end does not map and see how many I can recover, assessing both the sensitivity and false mapping rates. What do you think?
nilshomer is offline   Reply With Quote
Old 03-20-2012, 11:10 PM   #14
anurag.gautam
Member
 
Location: India

Join Date: Oct 2010
Posts: 15
Default

Hi ,
I tried to map illumina ~2 million reads to Oryza sativa indica reference genome with its reference gtf file using different versions of Tophat 1.1.4, 1.3.0, 1.3.1, 1.3.2, 1.3.3 and the current one 1.4.1 .
I used the defalut options just to check if the mapping statistics really gets affected. As a result, I got the following stats:
Reads Used Reads Mapped
Tophat1.1.4 2,000,000 2,27,554
Tophat1.3.0 2,000,000 2,30,817
Tophat1.3.1 2,000,000 2,31,935
Tophat1.3.2 2,000,000 4,517
Tophat1.3.3 2,000,000 2,31,935
Tophat1.4.1 2,000,000 1,37,724

I wanted to know why the number of reads mapped is varying in each version even though using the same data. Secondly, why there is a drastic change in the mapping stats with version 1.3.2 and 1.4.1 as compared with other versions? Can please anybody throw some light on this matter
anurag.gautam is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO