SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
[FastQC]Strange Per Sequence GC Content Kaidy Illumina/Solexa 9 03-23-2015 01:46 PM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 10:45 PM
Per Base Sequence Content sindrle RNA Sequencing 2 08-24-2013 08:19 AM
Bizarre RNAse cleanup problem HereBeDragons Sample Prep / Library Generation 1 07-24-2013 04:18 AM
Contamination? What does this per sequence GC content mean? ZoeG Bioinformatics 6 06-18-2013 04:15 PM

Reply
 
Thread Tools
Old 08-01-2014, 11:56 AM   #1
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default Bizarre per sequence GC content distribution, what could cause this?

Greetings. I am working with an Illumina library of whole genome sequence for a very AT-rich genome. When I initially did a FASTQC I saw this:

The bimodal distribution is indicative of contamination. I mapped the reads to a reference and looking at the unmapped reads, i see that they are indeed the source of the smaller bump. However, when I look at the MAPPED reads, I get this even stranger distribution (both actual and theoretical):

Normally, after the mapping the distribution with this organism looks like this:

The FASTQC analyses doesn't reveal any overrepresented sequences, and the library has only a 9.11% duplication level. What could possibly be going on?
Genomics101 is offline   Reply With Quote
Old 08-01-2014, 12:25 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That contradicts your initial plot - for one thing, there are more total reads after mapping.

Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment? Try using only uniquely-mapped reads, or at least, only primary alignments.
Brian Bushnell is offline   Reply With Quote
Old 08-01-2014, 12:42 PM   #3
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Hi, Brian, thanks for your response.

Quote:
That contradicts your initial plot - for one thing, there are more total reads after mapping.
This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.

Quote:
Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment?
This could easily be the case for the mapping, however I am not clear how this would artificially inflate the number of reads.

Quote:
Try using only uniquely-mapped reads
I would be keen to do this, how is this done?

Quote:
or at least, only primary alignments.
I'm not sure what this is.

Thanks again.
Genomics101 is offline   Reply With Quote
Old 08-01-2014, 12:51 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by Genomics101 View Post
This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.
Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent. The primary peak centered on 20% GC appeared to stay the same shape and same height.

You can use samtools on the sam or bam file from the mapped reads to filter only primary alignments, or change the parameters of the mapping program to only allow one alignment per read. With BBMap, for example, you would do so like this:

bbmap.sh ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq

...because its default behavior is to only output one read per input read. And this command will keep the output in fastq format.
Brian Bushnell is offline   Reply With Quote
Old 08-01-2014, 01:20 PM   #5
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Quote:
Originally Posted by Brian Bushnell View Post
Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent.
This is true and and I puzzled by it as well. Although the number of reads went down after I extracted only the mapped reads (or so I thought), not enough. I'm going to make sure I am doing things properly AND re-run the analysis and see where it goes. Thanks for the sharp eyes.

EDIT: Looks like I at least did the analysis correctly, and I am now adding the additional analyses you suggested.

Last edited by Genomics101; 08-01-2014 at 02:33 PM.
Genomics101 is offline   Reply With Quote
Old 08-01-2014, 03:18 PM   #6
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

The mystery continues. I did all the analyses again with exactly the same results. Filtering for the primary alignments however gives me MORE reads in total than i have with just the original FASTQ (now it's up to 1970667 from the original 1804074).

For what it's worth, this is the GC content:


Where are these extra reads (or what ever is getting interpreted as such) coming from?

Last edited by Genomics101; 08-01-2014 at 03:20 PM.
Genomics101 is offline   Reply With Quote
Old 08-01-2014, 03:51 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you are aligning with BWA, it now generates "chimeric" alignments that you can't get rid of just by filtering for primary; you also have to exclude reads with flag "0x800". I suggest you look for a couple of the reads with names that appear more than twice in the sam file and if possible post them here (the entire lines).
Brian Bushnell is offline   Reply With Quote
Old 08-01-2014, 05:44 PM   #8
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Hi, Brian. I'm doing that now. In the mean time, here is how it looks with reads that only have unique positions:

Although I didn't trim the reads I noticed that there are some strange, rogue small reads (20-50 bp) in there and I have no idea how that happened. I have a feeling those are the ones that are responsible for the strange distribution. Does anyone know how to remove them?

Last edited by Genomics101; 08-01-2014 at 07:06 PM.
Genomics101 is offline   Reply With Quote
Old 08-01-2014, 06:17 PM   #9
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Yes I am using BWA (MEM).

Here is a read whose name shows up 8 times:

Code:
M01243:91:000000000-A6ELJ:1:2119:9911:4339      65      Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2129    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      129     Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2193    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;
And I'll be darned, look at those last short reads, all AT! What is going on here I wonder?
Genomics101 is offline   Reply With Quote
Old 08-01-2014, 07:02 PM   #10
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Removed all reads less that 200bp and quality under 20 and viola:
Genomics101 is offline   Reply With Quote
Old 08-01-2014, 08:54 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Great, looks like the problem is solved!
Brian Bushnell is offline   Reply With Quote
Old 08-02-2014, 08:52 AM   #12
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know? Do you have any idea what is going on with those little AT-rich losers turning up with legit read names? It must have something to do with the library construction, because I have never had this trouble with AT-rich genomes (this organism or others) before. Thanks again!
Genomics101 is offline   Reply With Quote
Old 08-02-2014, 08:59 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

This looks like MiSeq data so those reads are likely examples of no insert or a very short insert. Then reads went into the adapter lawn on the flowcell.
GenoMax is offline   Reply With Quote
Old 08-02-2014, 10:21 AM   #14
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Thanks very much, GenoMax. Indeed, it is MiSeq data, but I never had this problem with MiSeq before (that was with 250bp PE reads, these are 300s). Can you tell me more the particular pathology with MiSeq? Is this a problem with library construction? And, goodness, what is an adapter lawn?
Genomics101 is offline   Reply With Quote
Old 08-02-2014, 10:41 AM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by Genomics101 View Post
A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know?
The problem seemed to be caused by reads that magically appeared during mapping, and that normally means reads with multiple alignments. And short, extreme-GC reads are much more likely to map in multiple places because of their low information content.
Brian Bushnell is offline   Reply With Quote
Old 08-02-2014, 10:45 AM   #16
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Thanks, Brian. This is where I am showing my ignorance I am sure, but how did the reads become so short? Looking at what I pulled out of the sam file, they are full-length (300bp) reads for the first few matches, but then become those little buggers are well.
Genomics101 is offline   Reply With Quote
Old 08-02-2014, 11:50 AM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BWA-mem produces 'chimeric alignments'. This is actually a really neat feature in some cases, and a big pain in other cases - in my opinion, it should be disabled by default.

If you look at the sam lines you posted, most of them have a bitflag (the second column) of over 2048. That indicates they are chimeric. BWA-mem appears to do multiple local alignments on reads, such that if there is a really good match for the first 20% somewhere, that will be presented as a single line in the sam file, and if there is a really good match for the middle 40%, that will be displayed as a different line, etc. So a single read could generate a huge number of lines in the sam file. The goal is to correctly map reads that are chimeric (such as reads from a cancer sample with two chromosomes randomly fused together). But apparently, it does not work well in extreme-GC genomes; most mappers are designed for human and mouse genomes, which have approximately 50% GC, as they constitute the majority of genetic research. But since I work at a place that strictly deals with microbial, plant, and fungal genomes, BBMap (which was originally designed for human) is now developed for and tested on a much wider array of organisms than most.

BWA's chimeric alignments are local and hard-clipped. For example, this cigar string from the second line you posted - "221H79M" - means that the first 221 bases were ignored and only the last 79 bases are included in the alignment. Of course, this will wreak havoc with something like fastqc, where all reads are weighted equally regardless of length. Rather than a length filter (which will unnecessarily exclude reads that had been adapter- or quality-trimmed), I think you should simply use samtools to filter out reads with the chimeric flag marked.

Last edited by Brian Bushnell; 08-02-2014 at 12:00 PM.
Brian Bushnell is offline   Reply With Quote
Old 08-02-2014, 12:20 PM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

Quote:
Originally Posted by Genomics101 View Post
Thanks very much, GenoMax. Indeed, it is MiSeq data, but I never had this problem with MiSeq before (that was with 250bp PE reads, these are 300s). Can you tell me more the particular pathology with MiSeq? Is this a problem with library construction? And, goodness, what is an adapter lawn?
Nucacidhunter has a nice description in this thread: http://seqanswers.com/forums/showthread.php?t=43071
GenoMax is offline   Reply With Quote
Reply

Tags
fastqc, gc-content

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:18 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO