Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Tommyliu
    Junior Member
    • Apr 2013
    • 1

    Two peaks on FastQC plot "Per sequence GC content"

    Hi,
    I just got illumina DNA genome re-sequencing data. All the items in FastQC reports passed but "Per sequence GC content". There are two peaks on the plot of "Per sequence GC content". The major peak centers around 40% GC content, while the minor peak centers around 70% GC content.

    I would appreciate it if you can explain to me how this happened and what I should do to correct it or discard the minor peak.

    Thanks in advance!
  • mastal
    Senior Member
    • Mar 2009
    • 666

    #2
    It suggests that maybe you have some kind of contamination.

    What %GC content are you expecting for the species you are sequencing?

    I would do adapter trimming/quality trimming and rerun FastQC afterwards to see whether that gets rid of the problem or not.

    Comment

    • Wallysb01
      Senior Member
      • Feb 2011
      • 286

      #3
      Its probably the adapters. Do some trimming and it will go away.

      Comment

      • simonandrews
        Simon Andrews
        • May 2009
        • 870

        #4
        If the secondary peak is very sharp it's probably a specific contaminant - often something which is found by the overrepresented sequences module.

        If the peak is fairly sharp and not too far from your main distribution it could be long read through into adapters as suggested above.

        If the secondary peak is quite broad then it might be that you have contamination with a different species. You could use something like fastq_screen to check for other species you work with regularly, but this won't pick up other odd sources of contamination.

        Comment

        • MichalGordon
          Junior Member
          • Jul 2012
          • 3

          #5
          The “Per base sequence content” and “Per base GC content” graphs should not show contamination of the adapters?

          Comment

          • simonandrews
            Simon Andrews
            • May 2009
            • 870

            #6
            Originally posted by MichalGordon View Post
            The “Per base sequence content” and “Per base GC content” graphs should not show contamination of the adapters?
            They might show some effects. If you have adapter dimers then you'll see the adapter sequence superimposed on the sequence content graphs. If your adapters have markedly different GC content than your library in general then you might also see an overall effect on the GC level.

            In the latest fastqc release there is a graph specifically to measure adapter content which will show exactly what proportion of the library is composed of read-through adapter which will illustrate this much better than trying to use sequence content plots.

            Comment

            • MichalGordon
              Junior Member
              • Jul 2012
              • 3

              #7
              Thank you!

              Comment

              • chariko
                Member
                • Jun 2010
                • 56

                #8
                Originally posted by simonandrews View Post
                They might show some effects. If you have adapter dimers then you'll see the adapter sequence superimposed on the sequence content graphs. If your adapters have markedly different GC content than your library in general then you might also see an overall effect on the GC level.

                In the latest fastqc release there is a graph specifically to measure adapter content which will show exactly what proportion of the library is composed of read-through adapter which will illustrate this much better than trying to use sequence content plots.
                I am having a similar problem with my run (2x150), As you can see there are two peaks in my run. I expect to have a 40% of GC content (bacterial genome) but I don know why did I obtain these two peaks.

                [PASS] Basic Statistics
                [PASS] Per base sequence quality
                [PASS] Per sequence quality scores
                [FAIL] Per base sequence content
                [FAIL] Per base GC content
                [WARNING] Per sequence GC content
                [PASS] Per base N content
                [WARNING] Sequence Length Distribution
                [WARNING] Sequence Duplication Levels
                [WARNING] Overrepresented sequences
                [WARNING] Kmer Content

                Oversequencing is probably not the problem because in fact I obtained less reads as expected. Could it be due to a adaptor problem? Any clue would be really appreciated
                Attached Files

                Comment

                • nucacidhunter
                  Jafar Jabbari
                  • Jan 2013
                  • 1250

                  #9
                  I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.

                  Comment

                  • chariko
                    Member
                    • Jun 2010
                    • 56

                    #10
                    Originally posted by nucacidhunter View Post
                    I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.
                    I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...

                    Comment

                    • simonandrews
                      Simon Andrews
                      • May 2009
                      • 870

                      #11
                      Originally posted by chariko View Post
                      I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...
                      The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.

                      Comment

                      • chariko
                        Member
                        • Jun 2010
                        • 56

                        #12
                        Originally posted by simonandrews View Post
                        The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.
                        As you can see in the per base composition plot the C content goes down on position 5 (as seen in the per base GC plot before and goes up on position 9. I assume as the manual tells, the first 12 positions could be a selection bias.
                        I assume everything is OK then since the GC content in the specie s around 40%,


                        It was an Nextera MiSeq bacterial genome sequencing experiment.

                        Thank you very much for your help
                        Attached Files
                        Last edited by chariko; 08-19-2014, 01:40 AM.

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          Everything is ok

                          Comment

                          • Khillo81
                            Junior Member
                            • May 2014
                            • 4

                            #14
                            Hi!

                            I have two problems: one is two peaks in the per sequence GC-content and another is a weird profile which I'm attaching here.

                            We're trying out Agilent's SureSelect enrichment protocol for Exome-Seq and have just concluded our first run on samples that were already done before using Illumina's Nextera kit (so we have another run with which to compare our results). The first run was sequenced on the Illumina HiSeq while this run was done on a MiSeq. Also, the first run was a 100bp paired end run while this was 150bp paired end run. Anyway, upon running a QC on the Fastq files I got this weird profile for the per-sequence GC content. I had already removed the low-quality reads and trimmed the adaptors but that didn't change anything. The only thing that helped was trimming 25 nucleotides from each end of the reads. Since we lose a lot of information that way, I'd prefer not to do this and want to ask if anyone has seen anything like this. I have no idea what might cause this.
                            Attached Files
                            Last edited by Khillo81; 10-14-2014, 04:55 AM.

                            Comment

                            • Brian Bushnell
                              Super Moderator
                              • Jan 2014
                              • 2709

                              #15
                              This is sometimes a sign of contamination, though if trimming the reads reduces it, that's a bit odd. Is this supposed to be human data? Human should peak around 50%, which does not correspond to either of your peaks. The most important question is what organism this is supposed to be, and what it's average GC% is.

                              Also, please post an insert-size histogram, which will help determine if the problem is caused by short inserts. You can get one quickly using BBMerge:

                              bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...