Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [FastQC]Strange Per Sequence GC Content

    Hi,

    I have got an illumina DNA genome re-sequencing data. All the items in FastQC reports are satisfactory but "Per sequence GC content". There is a minor peak close to the main peak (please see the attached fig).

    All the adapter sequences and low quality reads have already been removed, so I don't think the extra peak is caused by these sequences.

    I would appreciate it if you have got some idea what is the reason of the funny shape of the peak and what I should do to correct it.

    Thanks in advance!
    Attached Files

  • #2
    It's always a good idea to provide as much information as possible, for example, what organism this is. Some organisms (like fungi) often have at least one non-primary peak.

    But I encourage you to BLAST a few thousand reads against NR/NT/RefSeqMicrobial to see if you have contamination, which is a common cause of multiple peaks.


    Also, you may want to read this thread:
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
    Last edited by Brian Bushnell; 05-30-2014, 09:09 AM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      It's always a good idea to provide as much information as possible, for example, what organism this is. Some organisms (like fungi) often have at least one non-primary peak.

      But I encourage you to BLAST a few thousand reads against NR/NT/RefSeqMicrobial to see if you have contamination, which is a common cause of multiple peaks.
      Hi Brian,

      Thanks a lot for you suggestion. The organism I am working on is a plant with genome size around 1GB. Do you think microbial contamination would cause such an effect?

      Comment


      • #4
        It's certainly possible. Plants are also commonly - or, in the wild, always - invaded by fungi. Furthermore, organelles like chloroplast and mitochondria can have substantially different GC than the main organism. The best way to figure it out, in my opinion, is to use BLAST.

        Comment


        • #5
          Hi Kaidy,

          Have you tried aligning the data to your reference genome? On the whole I don't worry too much about weird peaks like this unless trying to explain a poor alignment rate.

          If the extra peak is created by contamination (very likely), then these sequences shouldn't align to your reference genome and will be discarded anyway. As Brian says, you may be able to identify where these come from using BLAST.

          If you're worried, you could always run FastQC again on just the reads that align. Picard tools also has a plot which uses the reference genome to highlight any GC biases within aligned data.

          Phil

          Comment


          • #6
            similar odd GC content distribution

            Hi everyone!

            I am picking up on this thread again because I stumbled across a similar problem.

            I recently started for the first time to analyze some RNAseq libraries made in our lab. After trimming (sickle) and mapping (tophat) I ran FastQC and saw an odd bimodal distribution in the GC content per sequence plot (attached). Besides this, there is a 5'end bias that I understand is kind of expected (not-so-random-primer-problem) and that is reflected in the sequence and k-mer content but seems (to me) to be unrelated to GC content (I am attaching the full fastqc report).

            After seeing this I went back to the original fastq files and the bimodal GC distribution is similar before mapping. In addition the oddity seems not to be specific to this single library, as other libraries in the same experiment seem to have a similar behaviour.

            I cracked my head over the odd GC content distribution in the last days but I found only few similar cases across the web and none of them gave me any good idea of what might be going on.

            Data info: PE Illumina sequencing, 100bp, Human post-mortem brain tissue

            Did anyone come across something like this before? Can you suggest any approach to figure out what that second peak is?

            Thank you!
            Marghi
            Attached Files
            Last edited by marghi; 03-20-2015, 09:18 AM.

            Comment


            • #7
              In the past this sort of thing has been seen when you have two (or more) organisms in the sample http://seqanswers.com/forums/showthread.php?t=48190.

              Since you have human brain tissue hopefully this should not apply. Have you looked to see if there are significant number of reads that do not map to human genome?

              A couple of other threads on this topic:

              Bridged amplification & clustering followed by sequencing by synthesis. (Genome Analyzer / HiSeq / MiSeq)

              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

              Comment


              • #8
                Hi GenoMax,

                Thank you very much for your prompt suggestions!

                The second link in particular shows a distribution similar to mine, even though in a different system. I will try to blast the most represented sequences, as suggested, although my mapping rate to human is good (over 85% from tophat logs, if my interpretation is correct).

                I was wondering: since the second peak shows up both before and after mapping, then it's unlikely to represent reads from another species, right? (if I didn't misunderstand something fundamental along the way or maybe 15% of the reads is enough to peak... )
                I also fastqc-ed the unmapped reads following suggestions in a simlar thread, and they do have their own second GC content peak at around 83-84% (attached).

                One of the posts in the thread asks why assuming that bimodal is wrong, but, err, I guess that many people saw many human RNAseq libraries all over the world by now and if a bimdal CG distribution is not typically seen then there must be something odd about the libraries I am looking at, no?

                Marghi
                Attached Files
                Last edited by marghi; 03-20-2015, 10:10 AM. Reason: add attachment

                Comment


                • #9
                  If everything maps to human, the second peak may be some feature with a different GC, like an organelle (mitochondria) or a ribosome. Or some super-highly-expressed gene with an odd GC. Anyway, I would consider it probably real, not an artifact. You could try splitting the reads by GC content and seeing where the odd ones map:

                  reformat.sh in=reads.fq out=high.fq mingc=0.8

                  Then map to the human transcriptome:

                  bbmap.sh ref=transcriptome.fa in=high.fq covstats=covstats.txt nzo

                  That will give you the coverage of each entry in the transcriptome.

                  Comment


                  • #10
                    Dear Brian,

                    Thank you very much for your suggestion as well. I am following this up and I will make sure to post what I (hopefully) find, in case this shows up again for somebody else in the future. Just it takes a while, because the libraries are huugee.

                    In the meanwhile I really want to make sure this is not some sort of technical problem: what I find of concern is that if this was "real" then it should have popped up before, I would expect. I am on the hunt for similar data to have terms of comparison.

                    Best regards

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X