Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Genomics101
    Member
    • May 2012
    • 60

    Bizarre per sequence GC content distribution, what could cause this?

    Greetings. I am working with an Illumina library of whole genome sequence for a very AT-rich genome. When I initially did a FASTQC I saw this:

    The bimodal distribution is indicative of contamination. I mapped the reads to a reference and looking at the unmapped reads, i see that they are indeed the source of the smaller bump. However, when I look at the MAPPED reads, I get this even stranger distribution (both actual and theoretical):

    Normally, after the mapping the distribution with this organism looks like this:

    The FASTQC analyses doesn't reveal any overrepresented sequences, and the library has only a 9.11% duplication level. What could possibly be going on?
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    That contradicts your initial plot - for one thing, there are more total reads after mapping.

    Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment? Try using only uniquely-mapped reads, or at least, only primary alignments.

    Comment

    • Genomics101
      Member
      • May 2012
      • 60

      #3
      Hi, Brian, thanks for your response.

      That contradicts your initial plot - for one thing, there are more total reads after mapping.
      This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.

      Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment?
      This could easily be the case for the mapping, however I am not clear how this would artificially inflate the number of reads.

      Try using only uniquely-mapped reads
      I would be keen to do this, how is this done?

      or at least, only primary alignments.
      I'm not sure what this is.

      Thanks again.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Originally posted by Genomics101 View Post
        This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.
        Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent. The primary peak centered on 20% GC appeared to stay the same shape and same height.

        You can use samtools on the sam or bam file from the mapped reads to filter only primary alignments, or change the parameters of the mapping program to only allow one alignment per read. With BBMap, for example, you would do so like this:

        bbmap.sh ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq

        ...because its default behavior is to only output one read per input read. And this command will keep the output in fastq format.

        Comment

        • Genomics101
          Member
          • May 2012
          • 60

          #5
          Originally posted by Brian Bushnell View Post
          Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent.
          This is true and and I puzzled by it as well. Although the number of reads went down after I extracted only the mapped reads (or so I thought), not enough. I'm going to make sure I am doing things properly AND re-run the analysis and see where it goes. Thanks for the sharp eyes.

          EDIT: Looks like I at least did the analysis correctly, and I am now adding the additional analyses you suggested.
          Last edited by Genomics101; 08-01-2014, 02:33 PM.

          Comment

          • Genomics101
            Member
            • May 2012
            • 60

            #6
            The mystery continues. I did all the analyses again with exactly the same results. Filtering for the primary alignments however gives me MORE reads in total than i have with just the original FASTQ (now it's up to 1970667 from the original 1804074).

            For what it's worth, this is the GC content:


            Where are these extra reads (or what ever is getting interpreted as such) coming from?
            Last edited by Genomics101; 08-01-2014, 03:20 PM.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              If you are aligning with BWA, it now generates "chimeric" alignments that you can't get rid of just by filtering for primary; you also have to exclude reads with flag "0x800". I suggest you look for a couple of the reads with names that appear more than twice in the sam file and if possible post them here (the entire lines).

              Comment

              • Genomics101
                Member
                • May 2012
                • 60

                #8
                Hi, Brian. I'm doing that now. In the mean time, here is how it looks with reads that only have unique positions:

                Although I didn't trim the reads I noticed that there are some strange, rogue small reads (20-50 bp) in there and I have no idea how that happened. I have a feeling those are the ones that are responsible for the strange distribution. Does anyone know how to remove them?
                Last edited by Genomics101; 08-01-2014, 07:06 PM.

                Comment

                • Genomics101
                  Member
                  • May 2012
                  • 60

                  #9
                  Yes I am using BWA (MEM).

                  Here is a read whose name shows up 8 times:

                  Code:
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      65      Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2129    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      129     Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2193    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;
                  And I'll be darned, look at those last short reads, all AT! What is going on here I wonder?

                  Comment

                  • Genomics101
                    Member
                    • May 2012
                    • 60

                    #10
                    Removed all reads less that 200bp and quality under 20 and viola:

                    Comment

                    • Brian Bushnell
                      Super Moderator
                      • Jan 2014
                      • 2709

                      #11
                      Great, looks like the problem is solved!

                      Comment

                      • Genomics101
                        Member
                        • May 2012
                        • 60

                        #12
                        A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know? Do you have any idea what is going on with those little AT-rich losers turning up with legit read names? It must have something to do with the library construction, because I have never had this trouble with AT-rich genomes (this organism or others) before. Thanks again!

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          This looks like MiSeq data so those reads are likely examples of no insert or a very short insert. Then reads went into the adapter lawn on the flowcell.

                          Comment

                          • Genomics101
                            Member
                            • May 2012
                            • 60

                            #14
                            Thanks very much, GenoMax. Indeed, it is MiSeq data, but I never had this problem with MiSeq before (that was with 250bp PE reads, these are 300s). Can you tell me more the particular pathology with MiSeq? Is this a problem with library construction? And, goodness, what is an adapter lawn?

                            Comment

                            • Brian Bushnell
                              Super Moderator
                              • Jan 2014
                              • 2709

                              #15
                              Originally posted by Genomics101 View Post
                              A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know?
                              The problem seemed to be caused by reads that magically appeared during mapping, and that normally means reads with multiple alignments. And short, extreme-GC reads are much more likely to map in multiple places because of their low information content.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 11:08 AM
                              0 responses
                              6 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              53 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...