Seqanswers Leaderboard Ad

**mcnelson.phd** · 01-09-2013, 05:34 AM

Without having more information about what the end user was doing exactly it's hard to guess at why they would have a lot of broken read pairs. I can't see how your library prep/sequencing methods would be at fault as each read would represent either end of a single fragment.

The most likely reason is that the end user is doing their processing incorrectly. Top causes would be that the insert size they provided their software is incorrect, their reference files are wrong, the read pairs aren't being maintained correctly during pre-processing, or that that they've screwed up the directionality.

My suggestion would be to have them submit a post here describing what they did so the community can make sure they're doing their data processing correctly.

**matth431** · 01-09-2013, 07:31 AM

Originally posted by mcnelson.phd View Post

Without having more information about what the end user was doing exactly it's hard to guess at why they would have a lot of broken read pairs. I can't see how your library prep/sequencing methods would be at fault as each read would represent either end of a single fragment.

The most likely reason is that the end user is doing their processing incorrectly. Top causes would be that the insert size they provided their software is incorrect, their reference files are wrong, the read pairs aren't being maintained correctly during pre-processing, or that that they've screwed up the directionality.

My suggestion would be to have them submit a post here describing what they did so the community can make sure they're doing their data processing correctly.

Thanks - they would be using CLC Genomics Workbench but not sure exactly what parameters they've used. Checking the data now using my own copy - also going to check the reads where the index was successfully ID'd and those unaligned separately to see if there's any difference.

**asteraceae** · 07-09-2014, 03:22 PM

Originally posted by mcnelson.phd View Post

Without having more information about what the end user was doing exactly it's hard to guess at why they would have a lot of broken read pairs. I can't see how your library prep/sequencing methods would be at fault as each read would represent either end of a single fragment.

The most likely reason is that the end user is doing their processing incorrectly. Top causes would be that the insert size they provided their software is incorrect, their reference files are wrong, the read pairs aren't being maintained correctly during pre-processing, or that that they've screwed up the directionality.

My suggestion would be to have them submit a post here describing what they did so the community can make sure they're doing their data processing correctly.

Hi there,

I have a similar problem and would be grateful for any advice this awesome community has.
I have a lane of PE illumina data from a large plant genome and my workflow was as follows:
- input raw reads, choose PE(forward, reverse orientation)
- select paired end distance for 190-250
- trim for quality and adapter contamination
- assemble denovo with "auto detect paired distance" (redundant as the insert size was also put in at input however this does act to confirm the insert size)

Despite this I still end up with ~60% broken pairs.

I cannot figure out why this would be the case. Is there another parameter I should be considering or is this likely a reflection of my actual data? I have performed a quality assessment on the raw reads as well using fastqc as well as within CLC bio itself and there was nothing that stood out to me; quality scores were consistently high, lowering towards the tail end of the read.

I have attached the summary report from the assembly for additional information.

Thanks in advance!

Attached Files

C.monticola_denovo_assembly_complete_may2014_summaryreport.pdf (162.2 KB, 15 views)

**mcnelson.phd** · 07-09-2014, 03:33 PM

Your assembly looks pretty bad, N50 of 375 bp and ~1.24 M contigs for a ~465 Mbp genome.

The problem you're likely having is that your library prep was the wrong choice for your genome. I don't work with plants, but from my limited knowledge gleaned from working with those who do, plant genomes are highly repetitive and often very poly-ploidy. This means you have a lot of repetitive elements which will kill your assembly and could be leading to lots of broken pairs in the mapping.

From your post, your library has a very small pair distance so any repetitive elements larger than say 500bp won't be resolved, leading to the fragmentation. What will get built are the non-repetitive parts and this is where you have broken pairs, because one read will be able to map to the non-repetitive region of a contig and the pair will want to map to either multiple other contigs/positions or have no contig to even map to.

For de novo assembly of plant genomes, you really need large insert libraries such as mate pair or even better, PacBio. The TruSeq Long Reads kit would also work well for what you want.

**asteraceae** · 08-18-2014, 11:26 AM

Thank you for taking the time to reply to me, I really appreciate it. After playing with the stringency settings, multiple kmer sizes, and chatting with the people over at CLC, I have to conclude that you are correct in your assessment.

It seems unlikely that I'll be able to resolve this with the current data set.

Thanks again for your input!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Lots of broken pairs

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News