Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Shotgun Meta of Environ Sam: Per Base Seq Cont Per Seq GC Cont failed aft trimming

    Dear all,

    I am really a newbie for analyzing shotgun metagenomics data. Here I encountered some issues when I checked the quality of my data. I post my concerns here and hope someone can help me.

    DNA samples: Genomic DNA isolated from environmental samples (soil, sewage, or freshwater). We are interested in the community structures of bacteria and archaea in those samples as well as detecting functional genes.

    Sequencing platform: Illumina, Shallow Metagenomics, Shotgun sequencing of DNA, Paired-end sequencing

    Library: Nextera kits (I got this information when running TrimGalore!)

    Concern-1: Per Base Sequence Content
    Before trimming, I checked the quality of the raw data using FastQC + MultiQC. Many samples failed the Per Base Sequence Content test with biased composition at the 5-end (see the attached Per Base Sequence Content-No trimming.jpg), and all samples failed the Adapter Content test (see the attached Adapter Content--No trimming.jpg). I then thought that I needed to trim the 5-end by removing 15 bp from each read and also trim the adapters. I trimmed all the raw reads with TrimGalore! with the following command:
    ===============
    ~/TrimGalore-0.6.5/trim_galore --clip_R1 15 --clip_R2 15 --paired read_1_sample_1.fastq.gz read_2_sample_2.fastq.gz read_1_sample_2.fastq.gz read_2_sample_2.fastq.gz … read_1_sample_N.fastq.gz read_2_sample_N.fastq.gz
    ===============
    After the trimming, I ran FastQC + MultiQC and found that, surprisingly, all samples failed the Per Base Sequence Content test. I found that all samples shared the same pattern: the 3-end is significantly biased with the content of C being very low (see the attached Per Base Sequence Content-After trimming.jpg).
    My question is, should I worry about the bias at the 3-end? Or, should I further trim the 3-end? Specifically, the curve/line for C is roughly horizontal before the trimming. Why this curve/line dropped to almost zero after the trimming? An online discussion (https://github.com/FelixKrueger/Trim...-auto-detectio) mentioned that [Note that the sharp decrease of A at the last position is a result of removing the adapter sequence very stringently, i.e. even a single trailing A at the end is removed.] However, as far as I can understand, the trimming at the 3-end just means removing the sequencing of the adapter (if there is sequencing read-through). The trimming should not affect the remaining (i.e., the sequence that is kept) sequences. If the curve of C before the trimming is horizontal, it should also be horizontal after the trimming. I am a bit confused.

    Concern-2: Per Sequence GC Content
    Before trimming, I found that many samples failed the Per Sequence GC Content test because of the multiple peaks in the plot (see the attached Per Sequence GC Content--No trimming.jpg). I thought that this failure was due to adapter contamination. However, after trimming, many samples still have the issue (see the attached Per Sequence GC Content--After trimming.jpg).

    My question is, why my samples show multiple peaks? Is it possible that my samples contain more than one dominant species? Or, the multiple peaks were due to sequencing/process errors? How should I fix this issue?

    Question-3: The sequencing I did is shallow sequencing. Also, my samples are not pure culture samples--they contain millions of different species of microbes. We will examine the microbial community structure and detect/find functional genes. In this case, should I do assembly before the downstream analysis? I read some online discussions. Some suggest assembly, and some say that it is better to skip the assembly. I am really new in this area and do not know which (with vs. without assembly) is a better choice.

    Thanks for reading this posting!
    Attached Files

  • #2
    Rule #1: Do not get hung up on the big red X's in FastQC.

    The thresholds which delineate Pass|Warn|Fail for the various metrics in FastQC were set using beautiful, single species, perfectly random and uniform genomic DNA libraries. Things that deviate from this in terms of sampling method, library content and library construction produce false failures. It is likely that the data is perfectly good for your organism(s), given that you are performing a metagenomic experiment with widely variable samples.

    You stated that you made these libraries using a Nextera kit. The tagmentation in Nextera library kits is not perfectly random, there is a sequence composition bias for the tagmentation site. Your original (untrimmed) Per Base Sequence content is perfectly normal for Nextera libraries; the bias at the 5' end simply shows the bias of the tagmentation enzyme. There is no need to trim the 5' end but if you want to go ahead.

    The highly skewed 3' end in the Per Base Sequence content plot after trimming I have seen before with trimmed reads. I'm not sure if it is an artifact of trimming or of the grouping algorithm in FastQC when it doesn't have enough bases left to include in its default group size of 5bp. (This is purely speculation.)

    Regarding the GC content plots, you are sampling a large diversity of bacteria from a variety of very distinct environments. It is totally expected that the bacterial populations in your different environments would have widely variable GC content distributions. This has nothing to do with adapters. Again, the failure is due to FastQC's expectations not matching the reality of the experiment you are performing.

    The Adapter content plot is the only one which really shows something you need to address. It is normal (especially for libraries prepared using Nextera kits) to have some fragments shorter than your read length (150bp in your case). Your particular libraries vary from ~20% to 35% in the percentage of fragments < 150bp. Performing 3' adapter trimming is required to remove adapter sequences from these reads.
    Last edited by kmcarr; 03-10-2020, 11:19 AM. Reason: Correct 5'/3' mixup

    Comment


    • #3
      Dear kmcarr,

      Thanks a lot for the reply and explaining the details. Appreciate that!

      After reading your response, I understand that the adapter contamination is the only thing that I need to worry about. I have used TrimGalore! to remove the adapters from the 3'-end of the raw reads. However, you also suggested that "Performing 5' adapter trimming is required to remove adapter sequences from these reads." I am a bit confused. Based on my current understanding (maybe I am wrong), in my case, I only have adapters at the 3'-end of the reads. Do we have adapters at both ends (3'- and 5'-)?

      Thanks again!

      Comment


      • #4
        Originally posted by yy273826987 View Post
        Dear kmcarr,

        Thanks a lot for the reply and explaining the details. Appreciate that!

        After reading your response, I understand that the adapter contamination is the only thing that I need to worry about. I have used TrimGalore! to remove the adapters from the 3'-end of the raw reads. However, you also suggested that "Performing 5' adapter trimming is required to remove adapter sequences from these reads." I am a bit confused. Based on my current understanding (maybe I am wrong), in my case, I only have adapters at the 3'-end of the reads. Do we have adapters at both ends (3'- and 5'-)?

        Thanks again!
        Sorry, that was an error. I meant to type "Performing 3' adapter trimming...."

        I have edited my original post to fix this.

        Comment


        • #5
          Dear kmcarr,

          Thanks for the quick response and the clarification.

          Here may I have more questions? For my specific case, should I perform assembly before downstream analysis?

          Also, after the Quality Control, which software or pipeline would you suggest for me to begin with (for assembly, annotation, taxonomic analysis, and finding functional genes)? I found that there are numerous software and pipelines. As a real newbie, I have a hard time to find which pipeline I shall start with.

          Thanks!

          Comment


          • #6
            Originally posted by yy273826987 View Post
            Dear kmcarr,

            Thanks for the quick response and the clarification.

            Here may I have more questions? For my specific case, should I perform assembly before downstream analysis?

            Also, after the Quality Control, which software or pipeline would you suggest for me to begin with (for assembly, annotation, taxonomic analysis, and finding functional genes)? I found that there are numerous software and pipelines. As a real newbie, I have a hard time to find which pipeline I shall start with.

            Thanks!
            yy2,

            The downstream analysis part is a bit outside my area so I'll have to leave that to others to help you.

            Cheers.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X