SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Strange FastQC "Per base sequence content report" tu.le Bioinformatics 10 12-23-2013 04:09 PM
Will over-amplification during library preparation "drown out" ChIPseq peaks? JIrish Epigenetics 3 02-25-2013 02:10 AM
What might cause the "Sequence Duplication Levels" failures in FastQC report? elrohir610 Bioinformatics 6 05-07-2012 09:38 PM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 07:55 AM
FastQC "Per Base Sequence Content": systematic deviation at 3' end of reads d f Illumina/Solexa 4 09-28-2010 09:46 AM

Reply
 
Thread Tools
Old 01-22-2014, 04:20 AM   #1
Tommyliu
Junior Member
 
Location: zurich

Join Date: Apr 2013
Posts: 1
Post Two peaks on FastQC plot "Per sequence GC content"

Hi,
I just got illumina DNA genome re-sequencing data. All the items in FastQC reports passed but "Per sequence GC content". There are two peaks on the plot of "Per sequence GC content". The major peak centers around 40% GC content, while the minor peak centers around 70% GC content.

I would appreciate it if you can explain to me how this happened and what I should do to correct it or discard the minor peak.

Thanks in advance!
Tommyliu is offline   Reply With Quote
Old 01-22-2014, 07:00 AM   #2
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

It suggests that maybe you have some kind of contamination.

What %GC content are you expecting for the species you are sequencing?

I would do adapter trimming/quality trimming and rerun FastQC afterwards to see whether that gets rid of the problem or not.
mastal is offline   Reply With Quote
Old 01-22-2014, 08:01 AM   #3
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Its probably the adapters. Do some trimming and it will go away.
Wallysb01 is offline   Reply With Quote
Old 01-22-2014, 11:58 PM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

If the secondary peak is very sharp it's probably a specific contaminant - often something which is found by the overrepresented sequences module.

If the peak is fairly sharp and not too far from your main distribution it could be long read through into adapters as suggested above.

If the secondary peak is quite broad then it might be that you have contamination with a different species. You could use something like fastq_screen to check for other species you work with regularly, but this won't pick up other odd sources of contamination.
simonandrews is offline   Reply With Quote
Old 06-16-2014, 02:03 AM   #5
MichalGordon
Junior Member
 
Location: Israel

Join Date: Jul 2012
Posts: 3
Default

The “Per base sequence content” and “Per base GC content” graphs should not show contamination of the adapters?
MichalGordon is offline   Reply With Quote
Old 06-16-2014, 02:09 AM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by MichalGordon View Post
The “Per base sequence content” and “Per base GC content” graphs should not show contamination of the adapters?
They might show some effects. If you have adapter dimers then you'll see the adapter sequence superimposed on the sequence content graphs. If your adapters have markedly different GC content than your library in general then you might also see an overall effect on the GC level.

In the latest fastqc release there is a graph specifically to measure adapter content which will show exactly what proportion of the library is composed of read-through adapter which will illustrate this much better than trying to use sequence content plots.
simonandrews is offline   Reply With Quote
Old 06-16-2014, 02:40 AM   #7
MichalGordon
Junior Member
 
Location: Israel

Join Date: Jul 2012
Posts: 3
Default

Thank you!
MichalGordon is offline   Reply With Quote
Old 08-18-2014, 06:21 AM   #8
chariko
Member
 
Location: Spain

Join Date: Jun 2010
Posts: 56
Default

Quote:
Originally Posted by simonandrews View Post
They might show some effects. If you have adapter dimers then you'll see the adapter sequence superimposed on the sequence content graphs. If your adapters have markedly different GC content than your library in general then you might also see an overall effect on the GC level.

In the latest fastqc release there is a graph specifically to measure adapter content which will show exactly what proportion of the library is composed of read-through adapter which will illustrate this much better than trying to use sequence content plots.
I am having a similar problem with my run (2x150), As you can see there are two peaks in my run. I expect to have a 40% of GC content (bacterial genome) but I don know why did I obtain these two peaks.

[PASS] Basic Statistics
[PASS] Per base sequence quality
[PASS] Per sequence quality scores
[FAIL] Per base sequence content
[FAIL] Per base GC content
[WARNING] Per sequence GC content
[PASS] Per base N content
[WARNING] Sequence Length Distribution
[WARNING] Sequence Duplication Levels
[WARNING] Overrepresented sequences
[WARNING] Kmer Content

Oversequencing is probably not the problem because in fact I obtained less reads as expected. Could it be due to a adaptor problem? Any clue would be really appreciated
Attached Images
File Type: png per_base_gc_content.png (13.9 KB, 151 views)
chariko is offline   Reply With Quote
Old 08-18-2014, 03:55 PM   #9
nucacidhunter
Senior Member
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,166
Default

I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.
nucacidhunter is offline   Reply With Quote
Old 08-19-2014, 12:29 AM   #10
chariko
Member
 
Location: Spain

Join Date: Jun 2010
Posts: 56
Smile

Quote:
Originally Posted by nucacidhunter View Post
I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.
I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...
chariko is offline   Reply With Quote
Old 08-19-2014, 12:34 AM   #11
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by chariko View Post
I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...
The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.
simonandrews is offline   Reply With Quote
Old 08-19-2014, 01:37 AM   #12
chariko
Member
 
Location: Spain

Join Date: Jun 2010
Posts: 56
Thumbs up

Quote:
Originally Posted by simonandrews View Post
The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.
As you can see in the per base composition plot the C content goes down on position 5 (as seen in the per base GC plot before and goes up on position 9. I assume as the manual tells, the first 12 positions could be a selection bias.
I assume everything is OK then since the GC content in the specie s around 40%,


It was an Nextera MiSeq bacterial genome sequencing experiment.

Thank you very much for your help
Attached Images
File Type: png per_base_sequence _content.png (27.9 KB, 100 views)

Last edited by chariko; 08-19-2014 at 01:40 AM.
chariko is offline   Reply With Quote
Old 08-19-2014, 03:10 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

Everything is ok
GenoMax is offline   Reply With Quote
Old 10-14-2014, 04:51 AM   #14
Khillo81
Junior Member
 
Location: Frankfurt, Germany

Join Date: May 2014
Posts: 4
Default

Hi!

I have two problems: one is two peaks in the per sequence GC-content and another is a weird profile which I'm attaching here.

We're trying out Agilent's SureSelect enrichment protocol for Exome-Seq and have just concluded our first run on samples that were already done before using Illumina's Nextera kit (so we have another run with which to compare our results). The first run was sequenced on the Illumina HiSeq while this run was done on a MiSeq. Also, the first run was a 100bp paired end run while this was 150bp paired end run. Anyway, upon running a QC on the Fastq files I got this weird profile for the per-sequence GC content. I had already removed the low-quality reads and trimmed the adaptors but that didn't change anything. The only thing that helped was trimming 25 nucleotides from each end of the reads. Since we lose a lot of information that way, I'd prefer not to do this and want to ask if anyone has seen anything like this. I have no idea what might cause this.
Attached Images
File Type: png GC-content.png (17.1 KB, 107 views)

Last edited by Khillo81; 10-14-2014 at 04:55 AM.
Khillo81 is offline   Reply With Quote
Old 10-14-2014, 08:17 AM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

This is sometimes a sign of contamination, though if trimming the reads reduces it, that's a bit odd. Is this supposed to be human data? Human should peak around 50%, which does not correspond to either of your peaks. The most important question is what organism this is supposed to be, and what it's average GC% is.

Also, please post an insert-size histogram, which will help determine if the problem is caused by short inserts. You can get one quickly using BBMerge:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt
Brian Bushnell is offline   Reply With Quote
Old 10-14-2014, 11:01 AM   #16
nucacidhunter
Senior Member
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,166
Default

Would you be able to post all of the FastQC output plots for comparison with other runs. For now, I would mention that Exome capture does not sample genome randomly, so it is not unusual to see what you are reporting.
nucacidhunter is offline   Reply With Quote
Old 10-15-2014, 06:27 AM   #17
Khillo81
Junior Member
 
Location: Frankfurt, Germany

Join Date: May 2014
Posts: 4
Default

Thanks for your response. I first have to mention that I don't have a very strong background in bioinformatics and am using the CLC Genomics Workbench (ver. 7.5) which has a GUI and runs on Windows. I have used the Workbench's 'Merge Overlapping Pairs' function to generate the histogram below (I'm guessing it's similar to the BBMerge mentioned by Brian). I also haven't used the FASTQC but the native QC check in the Workbench. I'm attaching the output here. As you can see there is no severe drop in quality along the reads and besides the peaks in GC content observed at the end of the read (as I understand it, typical for Illumina data), the GC content along read length is around 45%. And the samples are human.
Attached Images
File Type: png Merged pairs length distribution.png (16.7 KB, 58 views)
Attached Files
File Type: pdf HT1159_22212-PR1_S2_L001_R1_001 (paired) - graphical QC report.pdf (201.5 KB, 75 views)
Khillo81 is offline   Reply With Quote
Old 10-15-2014, 08:17 AM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Unfortunately, it looks like that tool does not merge reads with insert size shorter than read length, which was the point of the exercise. But from the graph I can infer that maybe 30% of the reads are indeed in that category, so there are a few possibilities:

1) The twin peaks are indeed from exon-capture bias, though I kind of doubt that, as it does not explain why trimming the reads would reduce it; and I would have expected such a bias to shift the peak center rather than creating a bimodal distribution, but of course it depends on the bait design.
2) There is an exonic and intronic peak, or gene and non-gene peak. The GC content of a gene changes markedly once you get just outside of its bounds. For example, just upstream of the gene, it becomes very AT-rich, IIRC. But, I don't really like that explanation either.
3) The adapter-trimming is unsuccessful or incomplete. From your GC content by base position, it looks fairly flat across the read, aside from the first 20 bp... so that doesn't make much sense either. Still, it wouldn't hurt to confirm. What were the total percent of reads and bases trimmed during adapter-trimming? I would expect something like 30% of the reads and maybe 5-10% of the bases. If you are using Nextera adapters, be sure you use those sequences for trimming.


I suggest that you bin some of your reads by GC - just split them into pairs with GC<50% and GC>50%. Map both to human and look at the mapping rates (ideally, forcing unclipped global alignments). If they are equivalent, then the issue is not caused by contamination or adapter sequence, and it's probably safe to ignore.

You can split the reads by GC content with my reformat tool:

reformat.sh in1=read1.fq in2=read2.fq out1=low1.fq out2=low2.fq maxgc=0.5

reformat.sh in1=read1.fq in2=read2.fq out1=high1.fq out2=high2.fq mingc=0.5
Brian Bushnell is offline   Reply With Quote
Old 12-29-2017, 08:23 AM   #19
Dr khani
Junior Member
 
Location: Tehran

Join Date: Jan 2017
Posts: 3
Default

my fastq GC content report has two peaks.can any one help me how i can assemble these type of data?
Attached Images
File Type: png index.png (90.5 KB, 19 views)
Dr khani is offline   Reply With Quote
Old 12-29-2017, 09:06 AM   #20
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 330
Default

As mentioned above, the two peaks could very well be a sign of a mixed sample (contamination).
You could remove the all the high GC content reads and see if this improves the assembly.
BBtools (BBduk?) has a GC content filter.
luc is offline   Reply With Quote
Reply

Tags
dna sequence, fastqc

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:23 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO