Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [FASTQC] Biases in GC whole sequence content

    Hey all, I am analyzing several RNA-Seq datasets and have noticed somewhat of an odd pattern. In all of the samples I've QC'd so far, there is a bias or "shoulder" towards reads with a low GC content. I am wondering what causes this, if it's a problem, and if so, what should I do about it.

    I am working with 3 different RNA-Seq datasets from two different organsims: Drosophila melanogaster and Aedes aegypti. These datasets were produced by 3 different research groups (including my own) and 3 different sequencing companies, so I doubt its an error in sample prep or sequencing. The sequencing was quite deep. The only common factor I can find is that all 3 groups used Illumina TruSeq kits for library construction. I know for a fact that this bias is not caused by the "random" hexamer priming issue or from low sequence quality, slicing the 5' end off and filtering out low quality reads has no effect on the GC bias "shoulder".

    Just curious what causes this phenomenon. I'm attaching one of the more extreme examples from before and after basic QC. My guess is that this bias won't really affect differential expression calling (since it's the same for all samples), but it's still weirding me out a bit.
    Attached Files

  • #2
    Have you BLASTed the libraries to look for contamination?

    Also, could that shoulder simply be poly-A sequences? You might try trimming poly-A tails and then rerunning FastQC.

    Comment


    • #3
      Although there was some TruSeq adapter contamination in this sample intially (2% of reads), dumping the low quality reads and trimming off the 5' ends got rid of it. There weren't any overrepresented sequences or enriched k-mers whatsoever (like poly-A sequences) after QC. I'm not sure what else I would be BLASTing in the libraries.

      However, after doing a bit of brainstorming, I might have a hypothesis for what the shoulder is. A lot of insect species (including D. melanogaster and A. aegypi) can be infected by a bacteria called Wolbachia (especially common in laboratory stocks). I checked the GC content for the A. aegypti transcriptome (which is the sample I posted here) and it's about ~50%, which corresponds to the main peak of the graphs I posted. The GC content of the Wolbachia genome is ~35%, which would match the second peak/shoulder. If this is the case, I'd find a bunch of Wolbachia-specific genes when I assemble the transcriptome. I could potentially mask out the Wolbachia contamination later when I start performing expression counting.

      (But I'm not quite to that point in my analysis yet, so I'll let you know what happens and post back here when I do. I'm going to be pretty amused if it turns out all 3 laboratories have a massive Wolbachia problem...)

      Comment


      • #4
        Right, BLAST would potentially let you know if it was bacterial, and if so what species, so you can better filter it.

        Comment


        • #5
          Hi Kazi1,
          Seems like you are default presuming that a transcriptome GC% profile should not be bimodal. Why would you presume that? I mean other than FastQC giving you a big red "X" next to "Per sequence GC content".

          --
          Phillip

          Comment


          • #6
            It's true, I've made the assumption that it shouldn't be bimodal simply on the basis of the "big red X" in FastQC. I haven't done that much bioinformatics work before, so I've been working through and trying to figure out what each of the QC flags mean. I got 3 red flags from FastQC right now: "per base sequence content" (from the random hexamer priming), "sequence duplication" (from the high level of coverage), and the "per sequence GC content". The "per sequence GC content" is the only one I can't explain.

            I know that FastQC is optimized for genomic DNA reads, so perhaps its just sending up that flag unnecessarily when dealing with RNA-Seq data? It'd be great if that's just the way transcriptomic data looks normally. I just wanted some second opinions (from people with more experience with FastQC/RNA-Seq).

            Comment


            • #7
              The big red "x" in FastQC are not an immediate indication of that step failing completely. Since you expect to see coexistence of an unrelated species (wolbachia), seeing strange GC distribution would be acceptable for your data.

              Comment


              • #8
                Even in genomic DNA libraries I occasionally see bimodal (or trimodal) distributions of GC% in that plot. Although contamination (or infestation) of the sample with another species is possible, I see no reason to presume it is the case.

                Still no harm pulling out a few thousand representative reads from the two peaks and blasting them to see if you get lots of best hits to a different phylum or kingdom. But it could be a waste of time and might even lead you to throwing out data that actually should be kept.

                The big red "X" issue is one that plagues us occasionally. You just need to take it in stride. It is just a program. You don't want to turn off your brain when using it.
                --
                Phillip

                Comment


                • #9
                  Ok good to keep in mind! Thanks to all for your advice!

                  Comment


                  • #10
                    Originally posted by kazi1 View Post
                    I'm going to be pretty amused if it turns out all 3 laboratories have a massive Wolbachia problem...)
                    Alot of insects have Wolbachia integrated into their chromosomes. Might not be contamination/infection.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X