Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • fahmida
    Member
    • Aug 2010
    • 54

    Unusually high duplicated Reads in Mate Pair Library

    Hi,

    Recently we received one lane of HiSeq 8kb Mate-Pair reads with 200million 100bp reads.The data is intended for de-novo assembly/scaffolding for ~800MB genome. Initial FastQC assessment indicates good data quality except reported UNUSUALLY high duplicated reads, which is 96.77%! Please find attached the relevant FastQC images.

    Searching for relevant posts revealed other reported duplication level as high as 80-85%, which could be due to the PCR bias. The sequencing service provider assured us that this level of duplication is common for illumina mate pair libraries. When we used this data with our existing data(illumina 2 lanes of 400bp PE + 1 lane of 700bp PE) we have either worse results than before (using CLC)or just minimal improvements (using SoapDenovo) in terms of N50, no. of contigs/scaffolds etc.
    Now we wonder:
    Is it common to have such high duplication level?
    Do we need to discard duplicated reads? If yes, best tools? (rmdup? picard?)
    and finally the Strategy to improve the assembly with the data we have.

    Thanks for your advice.

    Cheers.
    Attached Files
    Last edited by fahmida; 05-21-2013, 05:14 PM. Reason: wrong title
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    I'm not sure whether I read this right but it seems that _most_ of your reads appear much more than 10 times. Hence, once your remove all the duplicates, you will get down from 200M reads to maybe 10M unique ones, and this will surely be too little to assemble your genome. Also, a whopping 12% of the reads map to the adapter (and if I understand correctly, this means that you have been sequencing primer dimers rather than your genome).

    So, your sequencing provider needs to come up with a better excuse than claiming that this would be "common".

    Comment

    • kopi-o
      Senior Member
      • Feb 2008
      • 319

      #3
      It is common for mate pairs. You are *supposed* to get adapters (mate-pair linkers) due to the library prep. You need to pre-process the reads with something like http://genomes.sdsc.edu/downloads/deloxer/ before using them for assembly.

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Originally posted by fahmida View Post
        Is it common to have such high duplication level?
        Do we need to discard duplicated reads?
        Mate pair libraries are naturally very low diversity, and the larger the initial fragmentation, the lower the final library diversity. For an 8kbp library I am not terribly surprised by the duplication level you have observed. You have simply reach the saturation depth of this library. It is not common to sequence an entire HiSeq lane for one mate pair library as you do not need deep coverage from you mate pairs; they are only needed to scaffold contigs built from your deep, paired end coverage.

        You should also be aware that FastQC is only considering one read of the pair in calculating the duplication rate. When you perform a proper duplicate analysis which considers both members of the read pair the duplication rate will drop.

        Yes, you should remove duplicates. I normally use picard tools.

        Originally posted by Simon Anders View Post
        Also, a whopping 12% of the reads map to the adapter...
        Simon, FastQC reports the percentage of the contaminating sequence so it is 0.1164%, or 0.001164 as a fraction.
        Last edited by kmcarr; 05-22-2013, 04:56 AM. Reason: Added comment about paired end duplicates.

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          Okay, then better ignore my post. Seems I know much less about mate-pair libraries than I thought. ;-)

          Comment

          • fahmida
            Member
            • Aug 2010
            • 54

            #6
            Thanks for your comments and suggestions Simon, kopi-o and kmcarr. I am in the middle of running picard's MarkDuplicate, hopefully it'll give a realistic estimate of actual duplication level. Also, if possible, in our next HiSeq run I am planning to have 3kb and 5kb mate pairs in one lane.

            p.s. got the MarkDuplicate result, attached here.
            Attached Files

            Comment

            • kmcarr
              Senior Member
              • May 2008
              • 1181

              #7
              Originally posted by fahmida View Post
              Thanks for your comments and suggestions Simon, kopi-o and kmcarr. I am in the middle of running picard's MarkDuplicate, hopefully it'll give a realistic estimate of actual duplication level. Also, if possible, in our next HiSeq run I am planning to have 3kb and 5kb mate pairs in one lane.

              p.s. got the MarkDuplicate result, attached here.
              fahimda,

              The stats you provided show only ~1% of the read pairs were mapped. Why so low?

              Comment

              • fahmida
                Member
                • Aug 2010
                • 54

                #8
                Originally posted by kmcarr View Post
                fahimda,

                The stats you provided show only ~1% of the read pairs were mapped. Why so low?
                I am also puzzled by that and trying to gather an explanation! Using Bowtie's default parameters mate-pair reads are mapped to ~500,000 contigs generated from the first round of assembly (using 3 lanes paired-end).

                bowtie -t -S -p 20 --chunkmbs 50000 --un unaligned_8kbMatePair_reads.fastq 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq aln-pe.sam

                Could it be due to the fragmented nature of the contigs or reads having only partial match?

                Comment

                • Wallysb01
                  Senior Member
                  • Feb 2011
                  • 286

                  #9
                  Originally posted by fahmida View Post
                  I am also puzzled by that and trying to gather an explanation! Using Bowtie's default parameters mate-pair reads are mapped to ~500,000 contigs generated from the first round of assembly (using 3 lanes paired-end).

                  bowtie -t -S -p 20 --chunkmbs 50000 --un unaligned_8kbMatePair_reads.fastq 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq aln-pe.sam

                  Could it be due to the fragmented nature of the contigs or reads having only partial match?
                  Are your reads reverse-forward still, as is typical of mate-pair seqs? Should you add --rf as an option?

                  Comment

                  • fahmida
                    Member
                    • Aug 2010
                    • 54

                    #10
                    Originally posted by Wallysb01 View Post
                    Are your reads reverse-forward still, as is typical of mate-pair seqs? Should you add --rf as an option?
                    Thanks for pointing that. I've repeated the alignment, this time with bowtie2 with following parameters:
                    bowtie2 -t -p 20 -N 1 -I 4000 -X 9000 --rf --un unaligned_8kbMatePair_reads.fastq -x 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq -S bowtie2.aln.sam

                    And the got the following output:

                    200340177 reads; of these:
                    200340177 (100.00%) were paired; of these:
                    194552274 (97.11%) aligned concordantly 0 times
                    5639453 (2.81%) aligned concordantly exactly 1 time
                    148450 (0.07%) aligned concordantly >1 times
                    ----
                    194552274 pairs aligned concordantly 0 times; of these:
                    33711784 (17.33%) aligned discordantly 1 time
                    ----
                    160840490 pairs aligned 0 times concordantly or discordantly; of these:
                    321680980 mates make up the pairs; of these:
                    133741497 (41.58%) aligned 0 times
                    82899936 (25.77%) aligned exactly 1 time
                    105039547 (32.65%) aligned >1 times
                    66.62% overall alignment rate

                    Comment

                    • Wallysb01
                      Senior Member
                      • Feb 2011
                      • 286

                      #11
                      Hmm, I guess the discordinate maps are just the regular PE reads that come along as contamination with mate pair prep. Was this also after you trimmed adapter sequences?

                      Comment

                      Latest Articles

                      Collapse

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 11:58 AM
                      0 responses
                      10 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-05-2026, 10:09 AM
                      0 responses
                      25 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-04-2026, 08:59 AM
                      0 responses
                      35 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 12:03 PM
                      0 responses
                      58 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...