SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FastQC per base sequence content analyst Bioinformatics 14 02-15-2017 06:25 AM
Strange fastqc per base sequence content 3'end kirstyn Bioinformatics 16 01-05-2017 09:58 AM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 10:45 PM
FastQC Report: A horn in Per sequence GC content ?! leekb Illumina/Solexa 3 10-23-2013 09:07 PM
FastQC - strange 'per base sequence content' graph gconcepcion Bioinformatics 11 10-31-2011 12:39 AM

Reply
 
Thread Tools
Old 05-28-2014, 02:10 PM   #1
kazi1
Junior Member
 
Location: Vancouver

Join Date: May 2014
Posts: 4
Default [FASTQC] Biases in GC whole sequence content

Hey all, I am analyzing several RNA-Seq datasets and have noticed somewhat of an odd pattern. In all of the samples I've QC'd so far, there is a bias or "shoulder" towards reads with a low GC content. I am wondering what causes this, if it's a problem, and if so, what should I do about it.

I am working with 3 different RNA-Seq datasets from two different organsims: Drosophila melanogaster and Aedes aegypti. These datasets were produced by 3 different research groups (including my own) and 3 different sequencing companies, so I doubt its an error in sample prep or sequencing. The sequencing was quite deep. The only common factor I can find is that all 3 groups used Illumina TruSeq kits for library construction. I know for a fact that this bias is not caused by the "random" hexamer priming issue or from low sequence quality, slicing the 5' end off and filtering out low quality reads has no effect on the GC bias "shoulder".

Just curious what causes this phenomenon. I'm attaching one of the more extreme examples from before and after basic QC. My guess is that this bias won't really affect differential expression calling (since it's the same for all samples), but it's still weirding me out a bit.
Attached Images
File Type: png per_sequence_gc_content.png (30.2 KB, 66 views)
File Type: png per_sequence_gc_content_afterQC.png (30.5 KB, 57 views)
kazi1 is offline   Reply With Quote
Old 05-28-2014, 02:53 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Have you BLASTed the libraries to look for contamination?

Also, could that shoulder simply be poly-A sequences? You might try trimming poly-A tails and then rerunning FastQC.
Brian Bushnell is offline   Reply With Quote
Old 05-28-2014, 04:22 PM   #3
kazi1
Junior Member
 
Location: Vancouver

Join Date: May 2014
Posts: 4
Default

Although there was some TruSeq adapter contamination in this sample intially (2% of reads), dumping the low quality reads and trimming off the 5' ends got rid of it. There weren't any overrepresented sequences or enriched k-mers whatsoever (like poly-A sequences) after QC. I'm not sure what else I would be BLASTing in the libraries.

However, after doing a bit of brainstorming, I might have a hypothesis for what the shoulder is. A lot of insect species (including D. melanogaster and A. aegypi) can be infected by a bacteria called Wolbachia (especially common in laboratory stocks). I checked the GC content for the A. aegypti transcriptome (which is the sample I posted here) and it's about ~50%, which corresponds to the main peak of the graphs I posted. The GC content of the Wolbachia genome is ~35%, which would match the second peak/shoulder. If this is the case, I'd find a bunch of Wolbachia-specific genes when I assemble the transcriptome. I could potentially mask out the Wolbachia contamination later when I start performing expression counting.

(But I'm not quite to that point in my analysis yet, so I'll let you know what happens and post back here when I do. I'm going to be pretty amused if it turns out all 3 laboratories have a massive Wolbachia problem...)
kazi1 is offline   Reply With Quote
Old 05-28-2014, 04:57 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Right, BLAST would potentially let you know if it was bacterial, and if so what species, so you can better filter it.
Brian Bushnell is offline   Reply With Quote
Old 05-29-2014, 06:42 AM   #5
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Hi Kazi1,
Seems like you are default presuming that a transcriptome GC% profile should not be bimodal. Why would you presume that? I mean other than FastQC giving you a big red "X" next to "Per sequence GC content".

--
Phillip
pmiguel is offline   Reply With Quote
Old 05-29-2014, 11:10 AM   #6
kazi1
Junior Member
 
Location: Vancouver

Join Date: May 2014
Posts: 4
Default

It's true, I've made the assumption that it shouldn't be bimodal simply on the basis of the "big red X" in FastQC. I haven't done that much bioinformatics work before, so I've been working through and trying to figure out what each of the QC flags mean. I got 3 red flags from FastQC right now: "per base sequence content" (from the random hexamer priming), "sequence duplication" (from the high level of coverage), and the "per sequence GC content". The "per sequence GC content" is the only one I can't explain.

I know that FastQC is optimized for genomic DNA reads, so perhaps its just sending up that flag unnecessarily when dealing with RNA-Seq data? It'd be great if that's just the way transcriptomic data looks normally. I just wanted some second opinions (from people with more experience with FastQC/RNA-Seq).
kazi1 is offline   Reply With Quote
Old 05-29-2014, 11:52 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

The big red "x" in FastQC are not an immediate indication of that step failing completely. Since you expect to see coexistence of an unrelated species (wolbachia), seeing strange GC distribution would be acceptable for your data.
GenoMax is offline   Reply With Quote
Old 05-30-2014, 06:45 AM   #8
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Even in genomic DNA libraries I occasionally see bimodal (or trimodal) distributions of GC% in that plot. Although contamination (or infestation) of the sample with another species is possible, I see no reason to presume it is the case.

Still no harm pulling out a few thousand representative reads from the two peaks and blasting them to see if you get lots of best hits to a different phylum or kingdom. But it could be a waste of time and might even lead you to throwing out data that actually should be kept.

The big red "X" issue is one that plagues us occasionally. You just need to take it in stride. It is just a program. You don't want to turn off your brain when using it.
--
Phillip
pmiguel is offline   Reply With Quote
Old 06-02-2014, 03:02 PM   #9
kazi1
Junior Member
 
Location: Vancouver

Join Date: May 2014
Posts: 4
Default

Ok good to keep in mind! Thanks to all for your advice!
kazi1 is offline   Reply With Quote
Old 06-02-2014, 11:20 PM   #10
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

Quote:
Originally Posted by kazi1 View Post
I'm going to be pretty amused if it turns out all 3 laboratories have a massive Wolbachia problem...)
Alot of insects have Wolbachia integrated into their chromosomes. Might not be contamination/infection.
mikep is offline   Reply With Quote
Reply

Tags
bias, fastqc, gc content, shoulder

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO