SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing overlapping genes from annotation for RNAseq read count DRAT Bioinformatics 2 04-11-2014 04:53 AM
Tophat can't find read-pairs?? all_your_base Bioinformatics 2 12-17-2012 09:35 PM
Tools to identify read pairs? Kennels Bioinformatics 1 04-30-2012 09:48 PM
Overlapping and non-Overlapping pair-end reads with Tophat senpeng Illumina/Solexa 4 10-16-2011 07:43 PM
Simulate Illumina read-pairs gene coder Bioinformatics 4 07-07-2011 04:37 AM

Reply
 
Thread Tools
Old 03-21-2013, 05:20 AM   #1
Jeremy37
Member
 
Location: Montreal, Canada

Join Date: Feb 2011
Posts: 17
Default Dealing with overlapping read pairs

Some of our whole-genome libraries end up with low insert sizes (e.g. ~150) for 2x100 bp sequencing with Illumina HiSeq. I'm concerned about the effect this will have on variant calling.

Do you know how samtools and/or GATK deal with paired-end reads that overlap? I believe that samtools assumes the reads are independent. Therefore, if there is a PCR error in the middle of your insert, it may appear as two reads (the overlapping ends of a read pair). With low-coverage sequencing data this could lead to a significant number of false variants.

Is there a good way to deal with this?
Many thanks for your suggestions.
Jeremy37 is offline   Reply With Quote
Old 03-21-2013, 09:48 AM   #2
Jeremy37
Member
 
Location: Montreal, Canada

Join Date: Feb 2011
Posts: 17
Default

I'm sure this happens to a lot of people doing sequencing. Does everyone just assume that it's not a problem?
Samtools WILL call variants with just 2 reads. Also, with low-coverage data we don't necessarily want to filter out variants seen in 2 reads if other quality indicators are fine. What to do...
Jeremy37 is offline   Reply With Quote
Old 03-21-2013, 01:58 PM   #3
MeganS
Member
 
Location: US

Join Date: Sep 2010
Posts: 14
Default

Merge the overlapping reads. There are a number of tools that do this (eg FLASH)
MeganS is offline   Reply With Quote
Old 03-22-2013, 06:33 AM   #4
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by Jeremy37 View Post
I'm sure this happens to a lot of people doing sequencing. Does everyone just assume that it's not a problem?
Samtools WILL call variants with just 2 reads. Also, with low-coverage data we don't necessarily want to filter out variants seen in 2 reads if other quality indicators are fine. What to do...
I've been using clipOverlap on the aligned bam files. Just make sure the names of the two reads in each pair are identical (i.e. without the /1 or /2 suffix that some aligner add to the read names).

This page is also quite useful http://thegenomefactory.blogspot.co....aired-end.html

Best
Dario
dariober is offline   Reply With Quote
Old 03-22-2013, 09:09 AM   #5
Jeremy37
Member
 
Location: Montreal, Canada

Join Date: Feb 2011
Posts: 17
Default

Wow, clipOverlap looks great. Exactly what I am looking for!

All the other tools I have seen (e.g. FLASh) try to remove overlap or combine reads straight from the fastq files. In the case where you have a good reference genome, e.g. human, this is sure to be much less accurate because it doesn't use the rest of the read to determine with confidence (e.g. by alignment) whether the reads overlap.
I also need something that works on BAM files, since I will be getting them already aligned.
Many thanks Dario.
Jeremy37 is offline   Reply With Quote
Old 03-23-2013, 07:35 PM   #6
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

It would seem clipOverlap is potentially throwing away information; the ideal tool would update the qualities when the two reads agree with each other in the overlapping region, as you now have greater confidence that the base was read correctly from the fragment.
krobison is offline   Reply With Quote
Old 03-24-2013, 01:33 AM   #7
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Quote:
Originally Posted by krobison View Post
It would seem clipOverlap is potentially throwing away information; the ideal tool would update the qualities when the two reads agree with each other in the overlapping region, as you now have greater confidence that the base was read correctly from the fragment.

FLASH does this
JackieBadger is offline   Reply With Quote
Old 03-24-2013, 04:03 PM   #8
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Quote:
Originally Posted by JackieBadger View Post
FLASH does this
Yes, but as pointed out above there is a risk with FLASH and similar tools (& I use FLASH routinely) of it making a mistake on registering the reads on short imperfect repeats and artificially creating an indel. With the genome sequence in hand, more information is available to correctly merge the reads.
krobison is offline   Reply With Quote
Reply

Tags
gatk, hiseq, insert size, low coverage, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO