Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge or do not merge overlapping paired-end reads?

    Dear all,

    I would like to discuss with you, if it is truly meaningful to merge overlapping paired end (PE) reads from Illumina exome or whole genome sequencing into a single (SE) read. Your first impulse is probably "yes, of course!", and I think that is what people usually do if they have overlapping PE data, but I'd like to invite you to rethink this concept with me.

    Especially I want to raise the question if it is valid to count overlapping PE reads twice in the overlapping region. This depends on whether you consider these reads as independent. Naturally they come from the same amplification, i.e. the same cluster, but the sequencing of the two PE reads is independent. A contra argument is that we sequenced the same read twice, and if there is something in the DNA fragment that "triggers" a sequencing error in one read, then it is likely to occur in the other reads as well, leading us to count a sequencing error twice. However, if there is a true SNP in the fragment, we will correctly sequence it twice and have twice the read support for this SNP. However, this argument is only valid if you consider these two reads to be independent of each other, which I am not sure if this correct.

    Below I summarize the arguments pro and contra merging again. I'd appreciate your thoughts on the matter.

    PRO merging
    - merging an overlapping PE reads gives us one longer SE read and longer reads are better
    - if the DNA fragment "triggers" a sequencing error, both reads will have it. If we merge, there will only be one read with the error.
    - merging gives us higher base confidence in the overlapping region.
    - it is not valid to count reads coming from the same DNA fragment twice (not sure if this is correct).

    CONTRA merging
    - if the two paired reads are independent, merging will result in an artificial reduction of coverage, i.e. we throw away data.


    Finally for my specific exome sequencing projects, these thoughts let me to the question if I should try to avoid overlapping PE reads, i.e. change my study design. I'd be happy if you contribute your thoughts on this matter as well. Thank you very much.
    http://seqanswers.com/forums/showthread.php?t=61370
    Last edited by evakoe; 07-20-2015, 03:05 AM. Reason: additional argument, added link to another thread

  • #2
    I think most people's first reaction isn't "yes, of course", but "no, why bother". Most tools are already able to handle paired-end data, so it's unclear what you're really gaining from merging things. The situation is different if you're doing assembly, since then you can merge things as a way of correcting errors.

    Regarding counting overlapping regions twice, this is rarely a valid thing to do. What things such as samtools do is to look at overlapping regions and modify phred scores according to whether the overlapping sequences agree or not. If the sequences agree, then the phred score is increased. If they disagree then scores are decreased accordingly. I don't know how GATK deals with that off-hand, but I wouldn't be surprised if it does something similar.

    Comment


    • #3
      Whether you merge depends on what you are doing. As long as the merger is accurate, it's often helpful for assembly, as it yields longer reads which are generally more useful (allowing a longer kmer length or longer overlaps / fewer sequences for string graphs). It also can reduce error rates, if the quality scores are accurate; both reads will be assigned the higher-confidence base call where they differ.

      Merging is not necessary for mapping to call SNPs, though it is helpful if you are looking for long indels or mapping RNA-seq data, because indel/intron calls are more accurate toward the middle of a read, and longer (merged) reads have relatively more bases in the middle.

      Comment


      • #4
        I understand what you two are saying, it makes sense to me. I just want to do SNP and indel calling. If you have a good aligner that uses the overlapping PE information for higher confidence calls, then using the unmerged PE reads as input is likely the most accurate thing to do. Thanks alot for this insight.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X