Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplication levels

    Hello all,
    I'm currently analyzing a single-end 50bp-read RNAseq data, which was sequenced at an outside facility. I've got a very naive question, since I'm relatively new to all this.

    The facility provided me with what they call raw reads, containing sequencing adaptors etc. In addition to that, I also have the pre-processed "clean" reads. The details of the "cleaning", as they informed me, are as follows:
    1. Remove reads with adaptor sequences. 2. Remove reads in which the percentage of unknown bases (N) is greater than 10%. 3. Remove low quality reads. If the percentage of the low quality base (base with quality value ≤ 5) is greater than 50% in a read, we define this read as low quality.
    I've already used these for alignment and other downstream analyses, but I just wanted to make sure so went ahead to quality check the "clean" fastq files with FASTQC, which gives me an error that the sequence duplication levels are high(roughly >66% in average for each sample I have)

    I think this is because of the "cleaning" process, enriching the fastqs for higher quality data, but could this be due to any error during the library preparation step, or anything else? Would it even make sense QC'ing these processed fastq files?

    Ege

  • #2
    I don't see how the trimming would affect the duplicate levels.
    High duplicate levels are either due to PCR overamplification, or a low complexity library.

    Without more information, it is not possible to tell if the high duplicate levels are due to PCR over amplification, and therefore a problem, or due to a low complexity library, and are therefore representative of the library.

    If the amount of starting RNA was low and/or the number of PCR cycles was high, one would suspect PCR over amplification.
    If when examining the alignment peaks, one sees isolated sequences duplicated multiple times, one would also suspect PCR over amplification.

    It can be tricky to distinguish if high duplicate levels are due to PCR over amplification or a low complexity starting library. The researcher may not always be expecting a low complexity library. For example, I had an RNA-Seq sample of a cytoplasmic fraction with a high duplication rate because the library had been prepared using ribosomal depletion. An RNA signalling molecule present in very high numbers in the cytoplasm had not been removed.

    Sometimes, you need to really understand your samples to identify the cause of the high duplicate levels.

    Comment


    • #3
      If coverage is high enough, you will have duplicates. Consider a 5 MB genome. At most you could have 100,000 unique 50bp reads; any more must be duplicates.

      RNA-seq data often has some genes that have super-high expression levels; if a gene has 1000x coverage, with 50bp reads, then at least 95% of them must be duplicates, because unique reads can only reach 50x coverage. I think FastQC's warning is based on the assumption that you have DNA data; I would ignore it.

      Duplicates often come from over-amplification with PCR, too, but generally it's possible to determine the cause of the duplicates, if you know what to look for. Mapping and looking at the mapped reads in IGV can help. High levels of PCR duplicates will have a distinctive patchy coverage. Normally people don't remove duplicates from RNA-seq data because that interferes with quantification, so if the duplicates are indeed from amplification, you should either ignore them, or redo the experiment with more RNA and less amplification if they are actually a problem.

      The cleaning process sounds OK to me, but normally I recommend adapter trimming rather than adapter filtering, because you lose less data. The cleaning would tend to increase the percent of duplicate reads by removing reads with errors, but it's not like it adds any new duplicates, so that doesn't really matter.

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        If coverage is high enough, you will have duplicates. Consider a 5 MB genome. At most you could have 100,000 unique 50bp reads; any more must be duplicates.
        For a 5 Mbp genome you can have 5,000,000 unique 50bp reads (or 100bp read, or 123bp reads, etc). A read starting a base n is unique from a read starting at base n+1 (e.g. 1-50, vs. 2-51). This assumes the genome is circular. If it is linear then the number of potential unique 50bp reads is 4,999,950.
        Last edited by kmcarr; 05-23-2014, 04:26 AM.

        Comment


        • #5
          Originally posted by kmcarr View Post
          For a 5 Mbp genome you can have 5,000,000 unique 50bp reads (or 100bp read, or 123bp reads, etc). A read starting a base n is unique from a read starting at base n+1 (e.g. 1-50, vs. 2-51). This assumes the genome is circular. If it is linear then the number of potential unique 50bp reads is 4,999,950.
          Woops, my math was totally wrong, that's correct =) For Xbp reads you can have at most X-fold unique coverage.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          66 views
          0 likes
          Last Post seqadmin  
          Working...
          X