SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Strange fastqc per base sequence content 3'end kirstyn Bioinformatics 16 01-05-2017 09:58 AM
Strange FastQC "Per base sequence content report" tu.le Bioinformatics 10 12-23-2013 04:09 PM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 10:45 PM
Strange FASTQC Results - PerSequence GC Content chongm Bioinformatics 6 03-05-2013 10:04 AM
FastQC - strange 'per base sequence content' graph gconcepcion Bioinformatics 11 10-31-2011 12:39 AM

Reply
 
Thread Tools
Old 05-30-2014, 08:04 AM   #1
Kaidy
Junior Member
 
Location: UK

Join Date: Jan 2014
Posts: 4
Default [FastQC]Strange Per Sequence GC Content

Hi,

I have got an illumina DNA genome re-sequencing data. All the items in FastQC reports are satisfactory but "Per sequence GC content". There is a minor peak close to the main peak (please see the attached fig).

All the adapter sequences and low quality reads have already been removed, so I don't think the extra peak is caused by these sequences.

I would appreciate it if you have got some idea what is the reason of the funny shape of the peak and what I should do to correct it.

Thanks in advance!
Attached Images
File Type: png per_sequence_gc_content.png (27.9 KB, 100 views)
Kaidy is offline   Reply With Quote
Old 05-30-2014, 09:05 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It's always a good idea to provide as much information as possible, for example, what organism this is. Some organisms (like fungi) often have at least one non-primary peak.

But I encourage you to BLAST a few thousand reads against NR/NT/RefSeqMicrobial to see if you have contamination, which is a common cause of multiple peaks.


Also, you may want to read this thread:
http://seqanswers.com/forums/showthread.php?t=43708

Last edited by Brian Bushnell; 05-30-2014 at 09:09 AM.
Brian Bushnell is offline   Reply With Quote
Old 05-30-2014, 09:10 AM   #3
Kaidy
Junior Member
 
Location: UK

Join Date: Jan 2014
Posts: 4
Default

Quote:
Originally Posted by Brian Bushnell View Post
It's always a good idea to provide as much information as possible, for example, what organism this is. Some organisms (like fungi) often have at least one non-primary peak.

But I encourage you to BLAST a few thousand reads against NR/NT/RefSeqMicrobial to see if you have contamination, which is a common cause of multiple peaks.
Hi Brian,

Thanks a lot for you suggestion. The organism I am working on is a plant with genome size around 1GB. Do you think microbial contamination would cause such an effect?
Kaidy is offline   Reply With Quote
Old 05-30-2014, 09:15 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It's certainly possible. Plants are also commonly - or, in the wild, always - invaded by fungi. Furthermore, organelles like chloroplast and mitochondria can have substantially different GC than the main organism. The best way to figure it out, in my opinion, is to use BLAST.
Brian Bushnell is offline   Reply With Quote
Old 05-30-2014, 11:49 AM   #5
ewels
Phil Ewels
 
Location: SciLifeLab, Stockholm, Sweden

Join Date: Mar 2011
Posts: 32
Default

Hi Kaidy,

Have you tried aligning the data to your reference genome? On the whole I don't worry too much about weird peaks like this unless trying to explain a poor alignment rate.

If the extra peak is created by contamination (very likely), then these sequences shouldn't align to your reference genome and will be discarded anyway. As Brian says, you may be able to identify where these come from using BLAST.

If you're worried, you could always run FastQC again on just the reads that align. Picard tools also has a plot which uses the reference genome to highlight any GC biases within aligned data.

Phil
ewels is offline   Reply With Quote
Old 03-20-2015, 09:14 AM   #6
marghi
Member
 
Location: Germany

Join Date: Mar 2015
Posts: 10
Default similar odd GC content distribution

Hi everyone!

I am picking up on this thread again because I stumbled across a similar problem.

I recently started for the first time to analyze some RNAseq libraries made in our lab. After trimming (sickle) and mapping (tophat) I ran FastQC and saw an odd bimodal distribution in the GC content per sequence plot (attached). Besides this, there is a 5'end bias that I understand is kind of expected (not-so-random-primer-problem) and that is reflected in the sequence and k-mer content but seems (to me) to be unrelated to GC content (I am attaching the full fastqc report).

After seeing this I went back to the original fastq files and the bimodal GC distribution is similar before mapping. In addition the oddity seems not to be specific to this single library, as other libraries in the same experiment seem to have a similar behaviour.

I cracked my head over the odd GC content distribution in the last days but I found only few similar cases across the web and none of them gave me any good idea of what might be going on.

Data info: PE Illumina sequencing, 100bp, Human post-mortem brain tissue

Did anyone come across something like this before? Can you suggest any approach to figure out what that second peak is?

Thank you!
Marghi
Attached Images
File Type: png per_sequence_gc_content.png (33.7 KB, 42 views)
Attached Files
File Type: gz accepted_hits_fastqc.tar.gz (474.5 KB, 1 views)

Last edited by marghi; 03-20-2015 at 09:18 AM.
marghi is offline   Reply With Quote
Old 03-20-2015, 09:21 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

In the past this sort of thing has been seen when you have two (or more) organisms in the sample http://seqanswers.com/forums/showthread.php?t=48190.

Since you have human brain tissue hopefully this should not apply. Have you looked to see if there are significant number of reads that do not map to human genome?

A couple of other threads on this topic:

http://seqanswers.com/forums/showthread.php?t=40147
http://seqanswers.com/forums/showthread.php?t=43708
GenoMax is offline   Reply With Quote
Old 03-20-2015, 10:09 AM   #8
marghi
Member
 
Location: Germany

Join Date: Mar 2015
Posts: 10
Default

Hi GenoMax,

Thank you very much for your prompt suggestions!

The second link in particular shows a distribution similar to mine, even though in a different system. I will try to blast the most represented sequences, as suggested, although my mapping rate to human is good (over 85% from tophat logs, if my interpretation is correct).

I was wondering: since the second peak shows up both before and after mapping, then it's unlikely to represent reads from another species, right? (if I didn't misunderstand something fundamental along the way or maybe 15% of the reads is enough to peak... )
I also fastqc-ed the unmapped reads following suggestions in a simlar thread, and they do have their own second GC content peak at around 83-84% (attached).

One of the posts in the thread asks why assuming that bimodal is wrong, but, err, I guess that many people saw many human RNAseq libraries all over the world by now and if a bimdal CG distribution is not typically seen then there must be something odd about the libraries I am looking at, no?

Marghi
Attached Images
File Type: png per_sequence_gc_content.png (30.6 KB, 22 views)

Last edited by marghi; 03-20-2015 at 10:10 AM. Reason: add attachment
marghi is offline   Reply With Quote
Old 03-20-2015, 12:20 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If everything maps to human, the second peak may be some feature with a different GC, like an organelle (mitochondria) or a ribosome. Or some super-highly-expressed gene with an odd GC. Anyway, I would consider it probably real, not an artifact. You could try splitting the reads by GC content and seeing where the odd ones map:

reformat.sh in=reads.fq out=high.fq mingc=0.8

Then map to the human transcriptome:

bbmap.sh ref=transcriptome.fa in=high.fq covstats=covstats.txt nzo

That will give you the coverage of each entry in the transcriptome.
Brian Bushnell is offline   Reply With Quote
Old 03-23-2015, 01:46 PM   #10
marghi
Member
 
Location: Germany

Join Date: Mar 2015
Posts: 10
Default

Dear Brian,

Thank you very much for your suggestion as well. I am following this up and I will make sure to post what I (hopefully) find, in case this shows up again for somebody else in the future. Just it takes a while, because the libraries are huugee.

In the meanwhile I really want to make sure this is not some sort of technical problem: what I find of concern is that if this was "real" then it should have popped up before, I would expect. I am on the hunt for similar data to have terms of comparison.

Best regards
marghi is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO