SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Overlapping and non-Overlapping pair-end reads with Tophat senpeng Illumina/Solexa 4 10-16-2011 07:43 PM
Determine paired end overlapping chariko Bioinformatics 2 04-29-2011 12:52 AM
Questions about overlapping paired-end reads... FredOnSeq Illumina/Solexa 6 04-18-2011 06:19 PM
How to manage overlapping paired-end reads? FredOnSeq Bioinformatics 2 09-09-2010 02:27 AM
How do variant callers deal with overlapping paired end reads? krobison Bioinformatics 1 04-30-2010 12:58 PM

Reply
 
Thread Tools
Old 03-15-2010, 10:42 PM   #1
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default Overlapping paired end - tophat

Hi,

I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

My questions are:

1) is this going to affect tophat alignment ? how should the -m option be specified?

2) when counting coverage, my intuition is that those overlapping bases might be counted twice, while they only appear in the library once, is there any way to get around this?

3) is this going to affect cufflinks transcript assembly and quantitation?

Thanks for your help!
wenhuang is offline   Reply With Quote
Old 03-15-2010, 11:39 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

Of course, this is not an ideal solution.

Simon
Simon Anders is offline   Reply With Quote
Old 03-16-2010, 01:03 AM   #3
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by Simon Anders View Post
I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

Of course, this is not an ideal solution.

Simon
how did you stitch them?
samtools merge?
KevinLam is offline   Reply With Quote
Old 03-16-2010, 06:51 AM   #4
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by wenhuang View Post
Hi,

I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

Thanks for your help!
Why not convert your paired end data into single end?
Since there is a 30 bp overlap. they should assemble into a single read quite nicely.

so you end up with a 120 bp SE data.
KevinLam is offline   Reply With Quote
Old 03-16-2010, 07:05 AM   #5
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

My alignment did not seem to have too much problem. Here is just a sample of the first few alignments. It appeared to me that the two reads were processed separately, but I am not so sure about that.

HWUSI-EAS787_0001:5:70:1610:809#AAATAG 99 chr1 5312 255 81M = 5366 0
GCGAGGAAAGAAATGCACTAAGTAAAAAACTTAGTCATTTTTTAAAGAGAATTAAAATGAAGTCCAATTCCTTTGAGTTAC HGHHI
HHHGHHHGGGHHHHHHHHIHHHGHFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHEHHFHEHGHHG NM:i:0
HWUSI-EAS787_0001:5:70:1610:809#AAATAG 147 chr1 5366 255 81M = 5312 0
AAATGAAGTCCAATTCCTTTGAGTTACAAATTTACAATCACTACTCAGTAATTAAAACTATTCAGTTATAGTGAACTGATT IHFHH
IHBGHHHHHGHHFEHHHHHHHHHHHHHHHHHHHHEHHGHHHHHHHHHHHHGGHHHHHHHHHHIHHHHHHGHHHHHH NM:i:0


HWUSI-EAS787_0001:5:30:1504:1763#TTGTCG 163 chr1 5822 255 81M = 5860 0
CCAGAGCCCACAGCTTACTTTTGGTGGTACCCATCCTAAGGGTCTGGGCAAACATATAACGATAAATGTCCATCATTATAA HHGHH
GGFHHHHHHHHHEHHHHHHHHHHHEHHGHDEGHHHHHBBBGGG7FHH2HEHBHH0FHEFHC+?6><CC-CEDDBA@ NM:i:0
HWUSI-EAS787_0001:5:30:1504:1763#TTGTCG 83 chr1 5860 255 81M = 5822 0
AGGGTCTGGGCAAACATATAACGATAAATGTCCATCATTATAATATCACACAGAGTAGTTTCACTGCCCTGAAACTCTTTT G@CBF
HE?G=HHGIHHHHGHGHBHGHHHEGHDHHGHHFFHHHHHHHHHHGHHGHGFHCHHGHHHHFHHHHHHHHHHHHHHH NM:i:0



Quote:
Originally Posted by Simon Anders View Post
I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

Of course, this is not an ideal solution.

Simon
wenhuang is offline   Reply With Quote
Old 03-16-2010, 07:09 AM   #6
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

I think this is a decent solution. Many of my reads suffered from bad quality at the end though. Can you recommend a type of tools that might do this job ? Thanks!

Quote:
Originally Posted by KevinLam View Post
Why not convert your paired end data into single end?
Since there is a 30 bp overlap. they should assemble into a single read quite nicely.

so you end up with a 120 bp SE data.
wenhuang is offline   Reply With Quote
Old 03-16-2010, 07:47 AM   #7
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by wenhuang View Post
I think this is a decent solution. Many of my reads suffered from bad quality at the end though. Can you recommend a type of tools that might do this job ? Thanks!
I only know phrap which can do this but if applied to so many reads I am not sure how long it will take.
KevinLam is offline   Reply With Quote
Old 03-16-2010, 09:56 AM   #8
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by wenhuang View Post
Hi,

I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

My questions are:

1) is this going to affect tophat alignment ? how should the -m option be specified?

2) when counting coverage, my intuition is that those overlapping bases might be counted twice, while they only appear in the library once, is there any way to get around this?

3) is this going to affect cufflinks transcript assembly and quantitation?

Thanks for your help!
As of TopHat 1.0.13, you should be able to specify a negative inner distance of -30. TopHat does map the reads independently, and has a different algorithm from Bowtie for handling the ends. The coverage.wig file display depth of read coverage, not depth of physical coverage, so those bases will be double counted, as you suggest. However, Cufflinks operates at the fragment level, not the read level, and so should do the right thing here.
Cole Trapnell is offline   Reply With Quote
Old 03-16-2010, 11:27 AM   #9
ecabot
Junior Member
 
Location: Madison, WI

Join Date: Jul 2008
Posts: 6
Default

Here are more details about Wen's run which was 2x75.

The minimum fragment size, including flanking adapters is 150 bp. Thus fragments with the smallest insert could be diagrammed like this with 32 bases of overlapping cDNA


[adapter:59][cDNA 32][adapter:59]
o~~~~~~~~~~~> (with 43bp of adapter)
<~~~~~~~~~~~~o


I am assuming, however that reads this short would fail to map because of the high proportion of adapter-derived sequences embedded in the reads.


These considerations lead me to the following questions:


1) Does the negative inner distance of, for example, -30 reflect an expected mean of 30 bp of overlap or does it specify a maximum amount of overlap.

Afterall, most of Wen's reads don't overlap and the overlap could be as high as a full 75bp for a 193bp fragment. If I were to calculate the actual mean inner distance taking overlaps as having negative distances, the overall mean might well turn out to be positive.

2) If we were to trim the adapters this would invariably lead to a distribution of read lengths rather than a uniform 75 bases. Can Bowtie and TopHat deal with unequal read lengths or is this likely to be a problem?
ecabot is offline   Reply With Quote
Old 03-16-2010, 11:29 AM   #10
ecabot
Junior Member
 
Location: Madison, WI

Join Date: Jul 2008
Posts: 6
Default

Here is how the diagram from my previous posting should look (with dots replacing whitespace). Sorry for the confusion.

[adapter:59][cDNA 32][adapter:59]
.............................o~~~~~~~~~~~> (with 43bp of adapter)
...........<~~~~~~~~~~~~o
ecabot is offline   Reply With Quote
Old 03-18-2010, 12:01 PM   #11
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Quote:
Originally Posted by Simon Anders View Post
I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

Of course, this is not an ideal solution.

Simon
In my case, it seems bowtie 0.12.3 (and also BWA) works well for overlap pair-end. I have 2*59 reads, and I found the ISIZE for many records is less than 118 and the FLAG field indicate they are properly mapped.
Auction is offline   Reply With Quote
Old 03-18-2010, 12:10 PM   #12
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Simon Anders View Post
I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

Of course, this is not an ideal solution.

Simon
TopHat and Bowtie use completely different procedures to handle paired ends, and their policies are not the same. TopHat maps the left and right reads independently, and recent versions should have no trouble with paired end libraries with negative inner distances and overlapping reads. With TopHat 1.0.13 and Cufflinks 0.8.0, I have processed an RNA-Seq library size selected to 100bp and sequenced with 2x76bp GAII. The mean inner distance in this case is negative, and the TopHat/Cufflinks stack produced fine results.

To answer a previous question - TopHat will not handle reads of different lengths gracefully, so if you make "virtual" long reads from overlapping mates, make sure to trim the products down to a uniform length.
Cole Trapnell is offline   Reply With Quote
Old 06-15-2010, 05:47 AM   #13
ACTGangster
Junior Member
 
Location: Gainesville, FL

Join Date: Sep 2009
Posts: 8
Default Another possible solution

I had to edit this post. I wrote a program that assembles overlapping paired ends from illumina. It used to be public but now it's private because I want to do a paper on it.

If you want a copy, you can e-mail me and I'll send it to you.

I tested it on 1.5 million reads that overlapping ~25 bp and it assembled about 78% into larger contigs which can then be de novo assembled. In the overlapping region, it chooses the nucleotide with the best quality score (if there is a discrepancy). If the there is a discrepancy and the quality scores are the same it chooses the appropriate ambiguous nucleotide.

Last edited by ACTGangster; 07-24-2010 at 06:26 PM. Reason: makebettered
ACTGangster is offline   Reply With Quote
Old 07-29-2010, 01:35 PM   #14
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

I uploaded a python script I wrote for this to SVAR:
http://code.google.com/p/standardize.../mergePairs.py
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 07-29-2010, 01:39 PM   #15
ACTGangster
Junior Member
 
Location: Gainesville, FL

Join Date: Sep 2009
Posts: 8
Default stitch

I open-sourced my Stitch program as I do not plan on writing a paper on it specifically.

http://github.com/audy/stitch

It runs on as many cores as you have. I did 20 million reads in 40 minutes on a 16-core mac pro.
ACTGangster is offline   Reply With Quote
Old 02-22-2011, 09:08 AM   #16
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Default stitch error

Iam trying to use stitch but got below error : Any suggestions?

$ stitch
Traceback (most recent call last):
File "/usr/bin/stitch", line 7, in ?
sys.exit(
File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 318, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 2221, in load_entry_point
return ep.load()
File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-py2.4.egg/pkg_resources.py", line 1954, in load
entry = __import__(self.module_name, globals(),globals(), ['__name__'])
File "/usr/lib64/python2.4/site-packages/PIL/__init__.py", line 1, in ?
#
File "build/bdist.linux-x86_64/egg/stitch/stitch.py", line 13, in ?
ImportError: No module named multiprocessing
gpcr is offline   Reply With Quote
Old 02-22-2011, 09:21 AM   #17
ACTGangster
Junior Member
 
Location: Gainesville, FL

Join Date: Sep 2009
Posts: 8
Default

ImportError: No module named multiprocessing

What version of python are you using? What operating system?
ACTGangster is offline   Reply With Quote
Old 02-22-2011, 09:27 AM   #18
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Default

@ACTGangster
using python2.4 on centos5.5
gpcr is offline   Reply With Quote
Old 02-22-2011, 09:35 AM   #19
ACTGangster
Junior Member
 
Location: Gainesville, FL

Join Date: Sep 2009
Posts: 8
Default

You need python 2.6 or greater.
ACTGangster is offline   Reply With Quote
Old 02-22-2011, 09:39 AM   #20
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Default

another error with python2.7
$ sudo python2.7 setup.py install
Traceback (most recent call last):
File "setup.py", line 9, in <module>
setup(
NameError: name 'setup' is not defined
gpcr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:48 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO