![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Removing overlapping genes from annotation for RNAseq read count | DRAT | Bioinformatics | 2 | 04-11-2014 04:53 AM |
Tophat can't find read-pairs?? | all_your_base | Bioinformatics | 2 | 12-17-2012 09:35 PM |
Tools to identify read pairs? | Kennels | Bioinformatics | 1 | 04-30-2012 09:48 PM |
Overlapping and non-Overlapping pair-end reads with Tophat | senpeng | Illumina/Solexa | 4 | 10-16-2011 07:43 PM |
Simulate Illumina read-pairs | gene coder | Bioinformatics | 4 | 07-07-2011 04:37 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Montreal, Canada Join Date: Feb 2011
Posts: 17
|
![]()
Some of our whole-genome libraries end up with low insert sizes (e.g. ~150) for 2x100 bp sequencing with Illumina HiSeq. I'm concerned about the effect this will have on variant calling.
Do you know how samtools and/or GATK deal with paired-end reads that overlap? I believe that samtools assumes the reads are independent. Therefore, if there is a PCR error in the middle of your insert, it may appear as two reads (the overlapping ends of a read pair). With low-coverage sequencing data this could lead to a significant number of false variants. Is there a good way to deal with this? Many thanks for your suggestions. |
![]() |
![]() |
![]() |
#2 |
Member
Location: Montreal, Canada Join Date: Feb 2011
Posts: 17
|
![]()
I'm sure this happens to a lot of people doing sequencing. Does everyone just assume that it's not a problem?
Samtools WILL call variants with just 2 reads. Also, with low-coverage data we don't necessarily want to filter out variants seen in 2 reads if other quality indicators are fine. What to do... |
![]() |
![]() |
![]() |
#3 |
Member
Location: US Join Date: Sep 2010
Posts: 14
|
![]()
Merge the overlapping reads. There are a number of tools that do this (eg FLASH)
|
![]() |
![]() |
![]() |
#4 | |
Senior Member
Location: Cambridge, UK Join Date: May 2010
Posts: 311
|
![]() Quote:
This page is also quite useful http://thegenomefactory.blogspot.co....aired-end.html Best Dario |
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Montreal, Canada Join Date: Feb 2011
Posts: 17
|
![]()
Wow, clipOverlap looks great. Exactly what I am looking for!
All the other tools I have seen (e.g. FLASh) try to remove overlap or combine reads straight from the fastq files. In the case where you have a good reference genome, e.g. human, this is sure to be much less accurate because it doesn't use the rest of the read to determine with confidence (e.g. by alignment) whether the reads overlap. I also need something that works on BAM files, since I will be getting them already aligned. Many thanks Dario. |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Boston area Join Date: Nov 2007
Posts: 747
|
![]()
It would seem clipOverlap is potentially throwing away information; the ideal tool would update the qualities when the two reads agree with each other in the overlapping region, as you now have greater confidence that the base was read correctly from the fragment.
|
![]() |
![]() |
![]() |
#7 | |
Senior Member
Location: Halifax, Nova Scotia Join Date: Mar 2009
Posts: 381
|
![]() Quote:
FLASH does this |
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Boston area Join Date: Nov 2007
Posts: 747
|
![]()
Yes, but as pointed out above there is a risk with FLASH and similar tools (& I use FLASH routinely) of it making a mistake on registering the reads on short imperfect repeats and artificially creating an indel. With the genome sequence in hand, more information is available to correctly merge the reads.
|
![]() |
![]() |
![]() |
Tags |
gatk, hiseq, insert size, low coverage, samtools |
Thread Tools | |
|
|