Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding FastQC output before Vs after trim_galore

    I am very new to genome assembly and teaching myself about pre-assembly QC steps.

    I performed fastQC analyses on HiSeq4000 data for forward and reverse paired end reads, following which I performed adapter trimming and base quality-dependent trimming using "trim_galore" - which is a wrapper around 'FastQC' and 'cutadapt'

    The syntax I used was
    Code:
     trim_galore --fastqc --illumina --paired --retain_unpaired EthFoc-2.S282_L007.1.txt EthFoc-2.S282_L007.2.txt
    I seek your help in understanding and interpreting some of the FastQC results, when comparing pre and post trim_galore. I am attaching pics from the fastQC report here (forward and reverse reads of paired ends, before vs after trimming).

    So you can see what I am referring to, I'm attaching nearly all of the pics from the fastQC html reports here - 5 attached here, 4 as links. The help I seek is in the form of answers to my questions below:

    1. Per Base Sequence Quality for forward reads is better than for reverse. In both cases, trimming improves overall quality - correct?
    Please see attached image 1

    2. Same is true for Per Tile Sequence Quality, thought it is a little harder to infer despite the color-based visualization, correct? Also, I am curious if tile-specific exclusion of Illumina reads ever becomes necessary, and if yes, then what tools can perform such filtering / exclusion, if at all available.
    Please see attached image 2

    3. Per Sequence Quality scores shift to the right of the X-axis (Phred Score), as expected from quality trimming step, yes? To the right extreme of these graphs, the slope appears less step after trimming than before trimming. This means that increase in the numbers of sequences with improved / sub-maximal per sequence quality score will likely improve my overall assembly, yes?
    Please see attached image 3

    4. I am most intrigued by Per Base Sequence Content before vs. after trimming, specifically at the position ~ 150nt. Is that abnormal? Also, at positions 1-10nt, are these sequences worth trimming away?
    Please see attached image 4

    5. The Per Sequence GC content is not discernibly different across the graphs in the composite image. For the fungal species being sequenced, overall GC content is commonly ~48-51%. I wonder if I should download Illumina files from NCBI SRA, for related fungal species, generated by other research groups, to check whether this deviation from the theoretical distribution is not uncommon. BTW, on basis of what genome reads is this theoretical curve plotted?
    Please see attached image 5

    6. For Per Base N content, there is a minor bump at position 1. Does this mean that my trimming was not performed as well as it should have been?
    Image Link pic 6 - http://bit.ly/2tzZhgs

    7. Because of the adapter and quality trimming, I am thinking changes in the Sequence Length Distribution are as expected. Would you agree?
    Image Link pic 7 - http://bit.ly/2tq5sjl

    8. For the Sequence Duplication Level graphs, I am not sure I understand the difference between the red and blue lines in the sub-panels. Interestinly the only bump is for repeats ~ > 10X, not sequences with fewer or more numbers of repeats. Is this species specific? And I wonder if I should compare this to SRA reads for identical or similar species, sequenced by other research groups. Thoughts?
    Image Link pic 8 - http://bit.ly/2uMdaHZ

    9. In terms of adapter content- this is what started it all, I saw FastQC return Illumina Universal Adapter content at multiple positions in the original reads, increasing all the way up to the read end. So I decided to run this trim_galore / cutadapt step. It seems totally normal that the adapter content would go away after this step. Correct?
    Image Link pic 9 - http://bit.ly/2eEsaSi

    THANK YOU!
    Attached Files

  • #2
    Cross-posted and answered at Biostars: https://www.biostars.org/p/264114

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM
    • seqadmin
      The Impact of AI in Genomic Medicine
      by seqadmin



      Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
      02-26-2024, 02:07 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-14-2024, 06:13 AM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-08-2024, 08:03 AM
    0 responses
    71 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-07-2024, 08:13 AM
    0 responses
    80 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-06-2024, 09:51 AM
    0 responses
    68 views
    0 likes
    Last Post seqadmin  
    Working...
    X