Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bizarre per sequence GC content distribution, what could cause this?

    Greetings. I am working with an Illumina library of whole genome sequence for a very AT-rich genome. When I initially did a FASTQC I saw this:

    The bimodal distribution is indicative of contamination. I mapped the reads to a reference and looking at the unmapped reads, i see that they are indeed the source of the smaller bump. However, when I look at the MAPPED reads, I get this even stranger distribution (both actual and theoretical):

    Normally, after the mapping the distribution with this organism looks like this:

    The FASTQC analyses doesn't reveal any overrepresented sequences, and the library has only a 9.11% duplication level. What could possibly be going on?

  • #2
    That contradicts your initial plot - for one thing, there are more total reads after mapping.

    Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment? Try using only uniquely-mapped reads, or at least, only primary alignments.

    Comment


    • #3
      Hi, Brian, thanks for your response.

      That contradicts your initial plot - for one thing, there are more total reads after mapping.
      This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.

      Perhaps you have a few poly-AT reads and a few poly-AT regions in the genome, and you are generating a huge amount of secondary alignments during mapping because those reads can map many times in the same small segment?
      This could easily be the case for the mapping, however I am not clear how this would artificially inflate the number of reads.

      Try using only uniquely-mapped reads
      I would be keen to do this, how is this done?

      or at least, only primary alignments.
      I'm not sure what this is.

      Thanks again.

      Comment


      • #4
        Originally posted by Genomics101 View Post
        This is not the case, the number or reads went from 1804074 total to 1801608 after filtering for unmapped reads. The maximum number on the Y-axis is higher after mapping because of the large population of high AT-content sequences.
        Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent. The primary peak centered on 20% GC appeared to stay the same shape and same height.

        You can use samtools on the sam or bam file from the mapped reads to filter only primary alignments, or change the parameters of the mapping program to only allow one alignment per read. With BBMap, for example, you would do so like this:

        bbmap.sh ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq

        ...because its default behavior is to only output one read per input read. And this command will keep the output in fastq format.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          Before mapping, there appear to be ~zero reads with 0% GC. After mapping, there are over 100,000. Those numbers are inconsistent.
          This is true and and I puzzled by it as well. Although the number of reads went down after I extracted only the mapped reads (or so I thought), not enough. I'm going to make sure I am doing things properly AND re-run the analysis and see where it goes. Thanks for the sharp eyes.

          EDIT: Looks like I at least did the analysis correctly, and I am now adding the additional analyses you suggested.
          Last edited by Genomics101; 08-01-2014, 02:33 PM.

          Comment


          • #6
            The mystery continues. I did all the analyses again with exactly the same results. Filtering for the primary alignments however gives me MORE reads in total than i have with just the original FASTQ (now it's up to 1970667 from the original 1804074).

            For what it's worth, this is the GC content:


            Where are these extra reads (or what ever is getting interpreted as such) coming from?
            Last edited by Genomics101; 08-01-2014, 03:20 PM.

            Comment


            • #7
              If you are aligning with BWA, it now generates "chimeric" alignments that you can't get rid of just by filtering for primary; you also have to exclude reads with flag "0x800". I suggest you look for a couple of the reads with names that appear more than twice in the sam file and if possible post them here (the entire lines).

              Comment


              • #8
                Hi, Brian. I'm doing that now. In the mean time, here is how it looks with reads that only have unique positions:

                Although I didn't trim the reads I noticed that there are some strange, rogue small reads (20-50 bp) in there and I have no idea how that happened. I have a feeling those are the ones that are responsible for the strange distribution. Does anyone know how to remove them?
                Last edited by Genomics101; 08-01-2014, 07:06 PM.

                Comment


                • #9
                  Yes I am using BWA (MEM).

                  Here is a read whose name shows up 8 times:

                  Code:
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      65      Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2113    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2129    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      129     Pf3D7_14_v3     2584936 49      63M1D87M150S    =       2584936 0       GGCCATTTTAAATTTAACTACAATAGAAAGAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAGAAAAAAAGGAAAAATTAAAAAAAAAAAAAAAAATATAAAAAATATATTAATATATATAAATGTGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAATATTAAATAAAAAAAAAATAAATATATATAACATAGTAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA    CCCCC@<E<FGGC9<FGGGGGGGGGGGGGGGGGGGGGGGFCCFFFGGGGGGGGGGGGGGGGG++?EFGGGG>EF,FEF,4CE,CF++@FG+C7F**1,@3,@FFGG,D33@,<@,8@;,3<@F,@@;,8<@C<@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?C*:*32+2+0++<CFG5*1*/*+0<99+0+0+3A+++++3<0<++9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@*    NM:i:11 AS:i:93 XS:i:50 SA:Z:Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_08_v3     376972  0       221H79M Pf3D7_14_v3     2584936 0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAATATAAAAAAAATAAATAATAAATAAAAA +9AEGGGGGGGGGGGEGG>8CEGDGDC*85?8***+0++<*2::EC8***/*3*+2<;+:5)*?*2*+******0:C@* NM:i:6  AS:i:52 XS:i:51 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_07_v3,421764,+,132S45M123S,0,0;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2177    Pf3D7_07_v3     421764  0       132H45M123H     Pf3D7_14_v3     2584936 0       TAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAA   <@FGGGGGGGGGGGGGGC8*:CEGEGC5**2;FD*/8?8?E8*:?   NM:i:0  AS:i:45 XS:i:44 SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_09_v3,566928,-,104S35M161S,49,0;
                  M01243:91:000000000-A6ELJ:1:2119:9911:4339      2193    Pf3D7_09_v3     566928  49      104H35M161H     Pf3D7_14_v3     2584936 0       TTTTTTTTATTTAATATTTTTTTTTTTTTTTTTTT     1*5GFC<++0+2+23*:*C?:*8E?8?8/*DF;2*     NM:i:0  AS:i:35 XS:i:0  SA:Z:Pf3D7_14_v3,2584936,+,63M1D87M150S,49,11;Pf3D7_08_v3,376972,+,221S79M,0,6;Pf3D7_07_v3,421764,+,132S45M123S,0,0;
                  And I'll be darned, look at those last short reads, all AT! What is going on here I wonder?

                  Comment


                  • #10
                    Removed all reads less that 200bp and quality under 20 and viola:

                    Comment


                    • #11
                      Great, looks like the problem is solved!

                      Comment


                      • #12
                        A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know? Do you have any idea what is going on with those little AT-rich losers turning up with legit read names? It must have something to do with the library construction, because I have never had this trouble with AT-rich genomes (this organism or others) before. Thanks again!

                        Comment


                        • #13
                          This looks like MiSeq data so those reads are likely examples of no insert or a very short insert. Then reads went into the adapter lawn on the flowcell.

                          Comment


                          • #14
                            Thanks very much, GenoMax. Indeed, it is MiSeq data, but I never had this problem with MiSeq before (that was with 250bp PE reads, these are 300s). Can you tell me more the particular pathology with MiSeq? Is this a problem with library construction? And, goodness, what is an adapter lawn?

                            Comment


                            • #15
                              Originally posted by Genomics101 View Post
                              A million thanks Brian, I have been puzzling over this for days and with your help it was unraveled in just some hours. The key was looking for the reads with >2 placements, how did you know?
                              The problem seemed to be caused by reads that magically appeared during mapping, and that normally means reads with multiple alignments. And short, extreme-GC reads are much more likely to map in multiple places because of their low information content.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X