Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequence Duplication Levels failure

    Hiii

    Good [morning | afternoon | evening | night]

    I used fastqc to qualify my data. At the beginning I had failure in (Pair base sequence content, Per base GC content, Per sequence GC content and Sequence duplication levels ). I noticed the most error was due to 9 first bases, so I trimmed them by trimmomatic. After that I still get error in (Per sequence GC content and Sequence duplication levels).

    For per sequence GC content, it is more than normal.

    For Sequence duplication levels the graph raises up after 9.

    (1)What should I do with them? Is it due to contamination?

    Btw my "Sequence duplication levels" has only one red line and no blue line. (2)Why it is like that? Is it related to the version? My fastqc is version v0.10.1

    I attached both results in a pdf file.

    (3)I know trimmomatic cut the noises, but how much I can trim my sequences without affecting my following analysis? (Of course I can cut a 90 base pairs sequence to a 20 base pairs but for further analysis it is not reliable. For example for cufflinks to measure differential gene expression) So what is the limitation for trimming?

    I am so sorry for so many questions.

    Thank you in advance for helping me
    Attached Files

  • #2
    FastQC frequently worries people when there's no need to worry, and doesn't always point out the things that are most important. I've got a few questions:
    • Are these RNA reads?
    • What is the expected GC fraction of your target genome?
    • How much DNA was present in the sample?
    • Have spike-ins (e.g. ERCC, lambda) been used?
    • What are the overrepresented sequences?


    In a best-case scenario, the double peak in the GC graph and the over-represented sequences could be explained by a spike-in taking up a large proportion of the reads, which would happen if the DNA hadn't been accurately quantified. Alternatively, a targeted sequencing of multiple genes might produce a similar effect.

    Comment


    • #3
      These are cDNA reads (made from RNA)
      I don't know the expected GC fraction of target genome (The data is for someone else and I should analyze it and enhance it).
      No spike-ins were used.
      There are three overrepresented sequences:
      1. CGCTCGCCGCTACTACGGGAATCGCTTTTGCTTTCTTTTCCTCTGGCTAC
      2. GATACCTAGGTACCCAGAGACGAGGAAGGGCGTAGCAAGCGACGAAATGC
      3. TGGATACCTAGGTACCCAGAGACGAGGAAGGGCGTAGCAAGCGACGAAAT

      Comment


      • #4
        Well, a BLAST of all those sequences returns 100% identity matches to chloroplast genomes (probably rice).

        My guess is that what you're seeing here is cDNA reads that haven't been properly depleted for high-abundance transcripts, so there is a large amount of contaminant sequences in the data. My ball-park assumption from looking at the GC graph would be that there is about 30% chloroplast sequence in there.

        If at all possible, I'd recommend that your collaborator re-sequences these samples including a RiboZero preparation:

        Compare key features of ribosomal RNA (rRNA) and globin mRNA depletion kits. View sample type compatibility and the rRNA types removed by each kit.


        Otherwise, run a mapping only to the chloroplast sequence of the target (e.g. Oryza sativa) and exclude those sequences (e.g. HISAT2 has "--un-conc" and "--un" options for doing precisely that), then re-run FastQC to see if it changes things. Even with that 30% contamination (assuming it's expected), you still should get reasonable results.

        Comment


        • #5
          Your answer surprised me. Yeap it's for rice and Oryza sativa. And the way you found the source of contamination made me excited. Smart answers

          So now I should find for rice chloroplast sequence and then exclude that from reads. but I don't know how to do it with HISAT as you mentioned. I have to learn it first.

          Thank you~Thank you~Thank you

          Comment


          • #6
            Originally posted by Saeideh View Post
            And the way you found the source of contamination made me excited.
            Yes, BLAST is very useful. I'm glad that NCBI still provides a service for "where is this sequence from", despite all the newer locally-faster search tools that are available.

            I don't know how to do it with HISAT as you mentioned. I have to learn it first.
            Learning HISAT2 would be a good idea, as it's the latest in a new generation of ultra-fast mappers, and has almost identical command-line parameters to Bowtie2. Another option would be STAR, which has a really great manual and might be easier to pick up and use as a naive high-throughput sequencing bioinformatician.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            14 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X