Unconfigured Ad

**Brian Bushnell** · 08-01-2014, 12:25 PM

That contradicts your initial plot - for one thing, there are more total reads after mapping.

Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment? Try using only uniquely-mapped reads, or at least, only primary alignments.

**Genomics101** · 08-01-2014, 12:42 PM

Hi, Brian, thanks for your response.

That contradicts your initial plot - for one thing, there are more total reads after mapping.

This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.

Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment?

This could easily be the case for the mapping, however I am not clear how this would artificially inflate the number of reads.

Try using only uniquely-mapped reads

I would be keen to do this, how is this done?

or at least, only primary alignments.

I'm not sure what this is.

Thanks again.

**Brian Bushnell** · 08-01-2014, 12:51 PM

Originally posted by Genomics101 View Post

This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.

Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent. The primary peak centered on 20% GC appeared to stay the same shape and same height.

You can use samtools on the sam or bam file from the mapped reads to filter only primary alignments, or change the parameters of the mapping program to only allow one alignment per read. With BBMap, for example, you would do so like this:

bbmap.sh ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq

...because its default behavior is to only output one read per input read. And this command will keep the output in fastq format.

**Genomics101** · 08-01-2014, 01:20 PM

Originally posted by Brian Bushnell View Post

Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent.

This is true and and I puzzled by it as well. Although the number of reads went down after I extracted only the mapped reads (or so I thought), not enough. I'm going to make sure I am doing things properly AND re-run the analysis and see where it goes. Thanks for the sharp eyes.

EDIT: Looks like I at least did the analysis correctly, and I am now adding the additional analyses you suggested.

**Genomics101** · 08-01-2014, 03:18 PM

The mystery continues. I did all the analyses again with exactly the same results. Filtering for the primary alignments however gives me MORE reads in total than i have with just the original FASTQ (now it's up to 1970667 from the original 1804074).

For what it's worth, this is the GC content:

Where are these extra reads (or what ever is getting interpreted as such) coming from?

**Brian Bushnell** · 08-01-2014, 03:51 PM

If you are aligning with BWA, it now generates "chimeric" alignments that you can't get rid of just by filtering for primary; you also have to exclude reads with flag "0x800". I suggest you look for a couple of the reads with names that appear more than twice in the sam file and if possible post them here (the entire lines).

**Genomics101** · 08-01-2014, 05:44 PM

Hi, Brian. I'm doing that now. In the mean time, here is how it looks with reads that only have unique positions:

Although I didn't trim the reads I noticed that there are some strange, rogue small reads (20-50 bp) in there and I have no idea how that happened. I have a feeling those are the ones that are responsible for the strange distribution. Does anyone know how to remove them?

**Genomics101** · 08-01-2014, 06:17 PM

Yes I am using BWA (MEM).

Here is a read whose name shows up 8 times:

Code:

M01243:91:000000000-A6ELJ:1:2119:9911:4339      65      Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2129    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      129     Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
M01243:91:000000000-A6ELJ:1:2119:9911:4339      2193    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;

And I'll be darned, look at those last short reads, all AT! What is going on here I wonder?

**Genomics101** · 08-01-2014, 07:02 PM

Removed all reads less that 200bp and quality under 20 and viola:

**Brian Bushnell** · 08-01-2014, 08:54 PM

Great, looks like the problem is solved!

**Genomics101** · 08-02-2014, 08:52 AM

A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know? Do you have any idea what is going on with those little AT-rich losers turning up with legit read names? It must have something to do with the library construction, because I have never had this trouble with AT-rich genomes (this organism or others) before. Thanks again!

**GenoMax** · 08-02-2014, 08:59 AM

This looks like MiSeq data so those reads are likely examples of no insert or a very short insert. Then reads went into the adapter lawn on the flowcell.

**Genomics101** · 08-02-2014, 10:21 AM

Thanks very much, GenoMax. Indeed, it is MiSeq data, but I never had this problem with MiSeq before (that was with 250bp PE reads, these are 300s). Can you tell me more the particular pathology with MiSeq? Is this a problem with library construction? And, goodness, what is an adapter lawn?

**Brian Bushnell** · 08-02-2014, 10:41 AM

Originally posted by Genomics101 View Post

A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know?

The problem seemed to be caused by reads that magically appeared during mapping, and that normally means reads with multiple alignments. And short, extreme-GC reads are much more likely to map in multiple places because of their low information content.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Bizarre per sequence GC content distribution, what could cause this?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News