Unconfigured Ad

**Simon Anders** · 05-22-2013, 01:47 AM

I'm not sure whether I read this right but it seems that _most_ of your reads appear much more than 10 times. Hence, once your remove all the duplicates, you will get down from 200M reads to maybe 10M unique ones, and this will surely be too little to assemble your genome. Also, a whopping 12% of the reads map to the adapter (and if I understand correctly, this means that you have been sequencing primer dimers rather than your genome).

So, your sequencing provider needs to come up with a better excuse than claiming that this would be "common".

**kopi-o** · 05-22-2013, 03:54 AM

It is common for mate pairs. You are *supposed* to get adapters (mate-pair linkers) due to the library prep. You need to pre-process the reads with something like http://genomes.sdsc.edu/downloads/deloxer/ before using them for assembly.

**kmcarr** · 05-22-2013, 04:51 AM

Originally posted by fahmida View Post

Is it common to have such high duplication level?
Do we need to discard duplicated reads?

Mate pair libraries are naturally very low diversity, and the larger the initial fragmentation, the lower the final library diversity. For an 8kbp library I am not terribly surprised by the duplication level you have observed. You have simply reach the saturation depth of this library. It is not common to sequence an entire HiSeq lane for one mate pair library as you do not need deep coverage from you mate pairs; they are only needed to scaffold contigs built from your deep, paired end coverage.

You should also be aware that FastQC is only considering one read of the pair in calculating the duplication rate. When you perform a proper duplicate analysis which considers both members of the read pair the duplication rate will drop.

Yes, you should remove duplicates. I normally use picard tools.

Originally posted by Simon Anders View Post

Also, a whopping 12% of the reads map to the adapter...

Simon, FastQC reports the percentage of the contaminating sequence so it is 0.1164%, or 0.001164 as a fraction.

**Simon Anders** · 05-22-2013, 05:29 AM

Okay, then better ignore my post. Seems I know much less about mate-pair libraries than I thought. ;-)

**fahmida** · 05-22-2013, 10:24 PM

Thanks for your comments and suggestions Simon, kopi-o and kmcarr. I am in the middle of running picard's MarkDuplicate, hopefully it'll give a realistic estimate of actual duplication level. Also, if possible, in our next HiSeq run I am planning to have 3kb and 5kb mate pairs in one lane.

p.s. got the MarkDuplicate result, attached here.

Attached Files

duplication.metrics.txt (2.4 KB, 19 views)

**kmcarr** · 05-23-2013, 03:43 AM

Originally posted by fahmida View Post

Thanks for your comments and suggestions Simon, kopi-o and kmcarr. I am in the middle of running picard's MarkDuplicate, hopefully it'll give a realistic estimate of actual duplication level. Also, if possible, in our next HiSeq run I am planning to have 3kb and 5kb mate pairs in one lane.

p.s. got the MarkDuplicate result, attached here.

fahimda,

The stats you provided show only ~1% of the read pairs were mapped. Why so low?

**fahmida** · 05-23-2013, 05:20 AM

Originally posted by kmcarr View Post

fahimda,

The stats you provided show only ~1% of the read pairs were mapped. Why so low?

I am also puzzled by that and trying to gather an explanation! Using Bowtie's default parameters mate-pair reads are mapped to ~500,000 contigs generated from the first round of assembly (using 3 lanes paired-end).

bowtie -t -S -p 20 --chunkmbs 50000 --un unaligned_8kbMatePair_reads.fastq 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq aln-pe.sam

Could it be due to the fragmented nature of the contigs or reads having only partial match?

**Wallysb01** · 05-23-2013, 07:56 PM

Originally posted by fahmida View Post

I am also puzzled by that and trying to gather an explanation! Using Bowtie's default parameters mate-pair reads are mapped to ~500,000 contigs generated from the first round of assembly (using 3 lanes paired-end).

bowtie -t -S -p 20 --chunkmbs 50000 --un unaligned_8kbMatePair_reads.fastq 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq aln-pe.sam

Could it be due to the fragmented nature of the contigs or reads having only partial match?

Are your reads reverse-forward still, as is typical of mate-pair seqs? Should you add --rf as an option?

**fahmida** · 05-24-2013, 02:14 PM

Originally posted by Wallysb01 View Post

Are your reads reverse-forward still, as is typical of mate-pair seqs? Should you add --rf as an option?

Thanks for pointing that. I've repeated the alignment, this time with bowtie2 with following parameters:
bowtie2 -t -p 20 -N 1 -I 4000 -X 9000 --rf --un unaligned_8kbMatePair_reads.fastq -x 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq -S bowtie2.aln.sam

And the got the following output:

200340177 reads; of these:
200340177 (100.00%) were paired; of these:
194552274 (97.11%) aligned concordantly 0 times
5639453 (2.81%) aligned concordantly exactly 1 time
148450 (0.07%) aligned concordantly >1 times
----
194552274 pairs aligned concordantly 0 times; of these:
33711784 (17.33%) aligned discordantly 1 time
----
160840490 pairs aligned 0 times concordantly or discordantly; of these:
321680980 mates make up the pairs; of these:
133741497 (41.58%) aligned 0 times
82899936 (25.77%) aligned exactly 1 time
105039547 (32.65%) aligned >1 times
66.62% overall alignment rate

**Wallysb01** · 05-24-2013, 03:24 PM

Hmm, I guess the discordinate maps are just the regular PE reads that come along as contamination with mate pair prep. Was this also after you trimmed adapter sequences?

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, Yesterday, 11:58 AM	0 responses 10 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 35 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 58 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Unusually high duplicated Reads in Mate Pair Library

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News