Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Lots of broken pairs

    Recently ran a single HiScan lane of shotgun DNA size-selected at ~700-800bp insert size (user wanted large inserts to avoid repeats as much as possible without mate-pair). We got lower cluster density than usual but lane still returned ~14Gbp sequence from a 2x101bp run. However, the user has reported a very high level (~80%) of broken pairs when mapping reads to references or de novo assembly. I was wondering if anyone here has come across this issue and knows a way around it?

    Some possible causes I can think of:

    1) insert size - largest we've tried, but don't think this is the cause as I've seen people here post results from 1.5Kbp MiSeq runs.

    2) this was a single indexed sample (only sample in lane) but the rest of the run employed indexing, so the sequencer had trouble with this lane due to the lack of bases in both laser channels during the index read. Only 20% of reads successfully found the index. I merged the FASTQ files from the indexed folder and the unaligned folder for this lane to give to the user, with the caveat that he would need to filter out the PhiX reads from the unaligned portion.

    Any help appreciated.

  • #2
    Without having more information about what the end user was doing exactly it's hard to guess at why they would have a lot of broken read pairs. I can't see how your library prep/sequencing methods would be at fault as each read would represent either end of a single fragment.

    The most likely reason is that the end user is doing their processing incorrectly. Top causes would be that the insert size they provided their software is incorrect, their reference files are wrong, the read pairs aren't being maintained correctly during pre-processing, or that that they've screwed up the directionality.

    My suggestion would be to have them submit a post here describing what they did so the community can make sure they're doing their data processing correctly.

    Comment


    • #3
      Originally posted by mcnelson.phd View Post
      Without having more information about what the end user was doing exactly it's hard to guess at why they would have a lot of broken read pairs. I can't see how your library prep/sequencing methods would be at fault as each read would represent either end of a single fragment.

      The most likely reason is that the end user is doing their processing incorrectly. Top causes would be that the insert size they provided their software is incorrect, their reference files are wrong, the read pairs aren't being maintained correctly during pre-processing, or that that they've screwed up the directionality.

      My suggestion would be to have them submit a post here describing what they did so the community can make sure they're doing their data processing correctly.
      Thanks - they would be using CLC Genomics Workbench but not sure exactly what parameters they've used. Checking the data now using my own copy - also going to check the reads where the index was successfully ID'd and those unaligned separately to see if there's any difference.

      Comment


      • #4
        Originally posted by mcnelson.phd View Post
        Without having more information about what the end user was doing exactly it's hard to guess at why they would have a lot of broken read pairs. I can't see how your library prep/sequencing methods would be at fault as each read would represent either end of a single fragment.

        The most likely reason is that the end user is doing their processing incorrectly. Top causes would be that the insert size they provided their software is incorrect, their reference files are wrong, the read pairs aren't being maintained correctly during pre-processing, or that that they've screwed up the directionality.

        My suggestion would be to have them submit a post here describing what they did so the community can make sure they're doing their data processing correctly.
        Hi there,

        I have a similar problem and would be grateful for any advice this awesome community has.
        I have a lane of PE illumina data from a large plant genome and my workflow was as follows:
        - input raw reads, choose PE(forward, reverse orientation)
        - select paired end distance for 190-250
        - trim for quality and adapter contamination
        - assemble denovo with "auto detect paired distance" (redundant as the insert size was also put in at input however this does act to confirm the insert size)

        Despite this I still end up with ~60% broken pairs.

        I cannot figure out why this would be the case. Is there another parameter I should be considering or is this likely a reflection of my actual data? I have performed a quality assessment on the raw reads as well using fastqc as well as within CLC bio itself and there was nothing that stood out to me; quality scores were consistently high, lowering towards the tail end of the read.

        I have attached the summary report from the assembly for additional information.

        Thanks in advance!
        Attached Files

        Comment


        • #5
          Your assembly looks pretty bad, N50 of 375 bp and ~1.24 M contigs for a ~465 Mbp genome.

          The problem you're likely having is that your library prep was the wrong choice for your genome. I don't work with plants, but from my limited knowledge gleaned from working with those who do, plant genomes are highly repetitive and often very poly-ploidy. This means you have a lot of repetitive elements which will kill your assembly and could be leading to lots of broken pairs in the mapping.

          From your post, your library has a very small pair distance so any repetitive elements larger than say 500bp won't be resolved, leading to the fragmentation. What will get built are the non-repetitive parts and this is where you have broken pairs, because one read will be able to map to the non-repetitive region of a contig and the pair will want to map to either multiple other contigs/positions or have no contig to even map to.

          For de novo assembly of plant genomes, you really need large insert libraries such as mate pair or even better, PacBio. The TruSeq Long Reads kit would also work well for what you want.
          Last edited by mcnelson.phd; 07-09-2014, 04:16 PM. Reason: Corrected.

          Comment


          • #6
            Thank you for taking the time to reply to me, I really appreciate it. After playing with the stringency settings, multiple kmer sizes, and chatting with the people over at CLC, I have to conclude that you are correct in your assessment.

            It seems unlikely that I'll be able to resolve this with the current data set.

            Thanks again for your input!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X