Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Source of duplication in illumina hiseq paired-end reads?

    I recently received my first short read data set, one lane of 2x100bp Illumina Hiseq reads. I'm hoping the community can help me identify the source of duplicate sequences indicated in fastQC reports on the data.

    FastQC showed high duplication (>60%) for both forward and reverse reads. The report for the reverse reads did not turn up any specific over-represented sequences, while the report for the forward reads identified a PCR primer and adapter sequence. However a bowtie alignment against illumina paired-end adapters and primers showed 0% alignment. And when I tried to use picard to mark and remove duplicate reads, no reads were removed (picard command below).

    My reads are from a single lane of 12 individually barcoded cDNA sub-libraries from a non-model organism (no reference genome). Six of these libraries were normalized (via DSN digestion), six were not. Has anyone seem similar fastQC curves for rna-seq data?

    Is there a way to search the file for the actual sequences that are highly duplicated?

    Full fastQC reports are attached.

    [command run as below, though I have omitted the path for each file]
    nohup java -jar MarkDuplicates.jar INPUT=sequence_file.bam OUTPUT=deduplicated_reads.bam METRICS_FILE=deduplicated_reads_metrics.txt REMOVE_DUPLICATES=true &



    Attached Files

  • #2
    I usually see 60-80 % duplication levels in non-normalized RNA-Seq samples. Did you de-multiplex the sequences to see whether there are differences in duplication levels between the normalized and non-normalized libraries? I'd suspect that the normalization didn't work out very well.

    FastQC looks at initial 50-mers for overrepresentation, but as you pointed out yourself, only some adapters were found on the fw strand. You can remove those with e.g. trimmomatic.

    I'm not sure about Picard, but samtools rmdup only works on mapped reads...

    Did you check the rRNA contamination levels already?

    Comment


    • #3
      You could run a k-mer counter on the data to check for overrepresentation, e.g. Meryl or Jellyfish...

      Comment


      • #4
        High duplication levels in RNA-Seq are not necessarily a problem. Duplication simply means that you're getting very high fold coverage. For RNA-Seq it's quite normal to oversequence highly expressed transcripts in order to be able to see lowly expressed transcripts. Duplication warnings are more of a concern when they occur in libraries where you're expecting more equal coverage. 60% also isn't very high - a badly PCR duplicated library might have duplication levels above 90% (our personal record is 98%!). For more details of how to interpret this plot you can look at this blog post.

        Comment


        • #5
          I agree.

          see this http://seqanswers.com/forums/showthr...ght=duplicates

          Originally posted by simonandrews View Post
          High duplication levels in RNA-Seq are not necessarily a problem. Duplication simply means that you're getting very high fold coverage. For RNA-Seq it's quite normal to oversequence highly expressed transcripts in order to be able to see lowly expressed transcripts. Duplication warnings are more of a concern when they occur in libraries where you're expecting more equal coverage. 60% also isn't very high - a badly PCR duplicated library might have duplication levels above 90% (our personal record is 98%!). For more details of how to interpret this plot you can look at this blog post.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X