Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with overlapping read pairs

    Some of our whole-genome libraries end up with low insert sizes (e.g. ~150) for 2x100 bp sequencing with Illumina HiSeq. I'm concerned about the effect this will have on variant calling.

    Do you know how samtools and/or GATK deal with paired-end reads that overlap? I believe that samtools assumes the reads are independent. Therefore, if there is a PCR error in the middle of your insert, it may appear as two reads (the overlapping ends of a read pair). With low-coverage sequencing data this could lead to a significant number of false variants.

    Is there a good way to deal with this?
    Many thanks for your suggestions.

  • #2
    I'm sure this happens to a lot of people doing sequencing. Does everyone just assume that it's not a problem?
    Samtools WILL call variants with just 2 reads. Also, with low-coverage data we don't necessarily want to filter out variants seen in 2 reads if other quality indicators are fine. What to do...

    Comment


    • #3
      Merge the overlapping reads. There are a number of tools that do this (eg FLASH)

      Comment


      • #4
        Originally posted by Jeremy37 View Post
        I'm sure this happens to a lot of people doing sequencing. Does everyone just assume that it's not a problem?
        Samtools WILL call variants with just 2 reads. Also, with low-coverage data we don't necessarily want to filter out variants seen in 2 reads if other quality indicators are fine. What to do...
        I've been using clipOverlap on the aligned bam files. Just make sure the names of the two reads in each pair are identical (i.e. without the /1 or /2 suffix that some aligner add to the read names).

        This page is also quite useful http://thegenomefactory.blogspot.co....aired-end.html

        Best
        Dario

        Comment


        • #5
          Wow, clipOverlap looks great. Exactly what I am looking for!

          All the other tools I have seen (e.g. FLASh) try to remove overlap or combine reads straight from the fastq files. In the case where you have a good reference genome, e.g. human, this is sure to be much less accurate because it doesn't use the rest of the read to determine with confidence (e.g. by alignment) whether the reads overlap.
          I also need something that works on BAM files, since I will be getting them already aligned.
          Many thanks Dario.

          Comment


          • #6
            It would seem clipOverlap is potentially throwing away information; the ideal tool would update the qualities when the two reads agree with each other in the overlapping region, as you now have greater confidence that the base was read correctly from the fragment.

            Comment


            • #7
              Originally posted by krobison View Post
              It would seem clipOverlap is potentially throwing away information; the ideal tool would update the qualities when the two reads agree with each other in the overlapping region, as you now have greater confidence that the base was read correctly from the fragment.

              FLASH does this

              Comment


              • #8
                Originally posted by JackieBadger View Post
                FLASH does this
                Yes, but as pointed out above there is a risk with FLASH and similar tools (& I use FLASH routinely) of it making a mistake on registering the reads on short imperfect repeats and artificially creating an indel. With the genome sequence in hand, more information is available to correctly merge the reads.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X