SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help understanding FASTQC output gkandoi RNA Sequencing 1 08-11-2015 01:16 PM
Help with trim_galore output using rrbs option Dipro Bioinformatics 0 05-19-2015 01:28 PM
Help understanding FastQC output vas72985 Bioinformatics 5 10-22-2014 03:50 PM
After using FASTQC and Trim_Galore on my data, I used BWA with my first paired end... prs321 Bioinformatics 11 06-17-2013 06:55 AM
Understanding GSNAP output burt Bioinformatics 0 01-16-2011 07:06 PM

Reply
 
Thread Tools
Old 07-22-2017, 03:21 PM   #1
anandksrao
Junior Member
 
Location: Sacramento

Join Date: Jun 2011
Posts: 9
Default Understanding FastQC output before Vs after trim_galore

I am very new to genome assembly and teaching myself about pre-assembly QC steps.

I performed fastQC analyses on HiSeq4000 data for forward and reverse paired end reads, following which I performed adapter trimming and base quality-dependent trimming using "trim_galore" - which is a wrapper around 'FastQC' and 'cutadapt'

The syntax I used was
Code:
 trim_galore --fastqc --illumina --paired --retain_unpaired EthFoc-2.S282_L007.1.txt EthFoc-2.S282_L007.2.txt
I seek your help in understanding and interpreting some of the FastQC results, when comparing pre and post trim_galore. I am attaching pics from the fastQC report here (forward and reverse reads of paired ends, before vs after trimming).

So you can see what I am referring to, I'm attaching nearly all of the pics from the fastQC html reports here - 5 attached here, 4 as links. The help I seek is in the form of answers to my questions below:

1. Per Base Sequence Quality for forward reads is better than for reverse. In both cases, trimming improves overall quality - correct?
Please see attached image 1

2. Same is true for Per Tile Sequence Quality, thought it is a little harder to infer despite the color-based visualization, correct? Also, I am curious if tile-specific exclusion of Illumina reads ever becomes necessary, and if yes, then what tools can perform such filtering / exclusion, if at all available.
Please see attached image 2

3. Per Sequence Quality scores shift to the right of the X-axis (Phred Score), as expected from quality trimming step, yes? To the right extreme of these graphs, the slope appears less step after trimming than before trimming. This means that increase in the numbers of sequences with improved / sub-maximal per sequence quality score will likely improve my overall assembly, yes?
Please see attached image 3

4. I am most intrigued by Per Base Sequence Content before vs. after trimming, specifically at the position ~ 150nt. Is that abnormal? Also, at positions 1-10nt, are these sequences worth trimming away?
Please see attached image 4

5. The Per Sequence GC content is not discernibly different across the graphs in the composite image. For the fungal species being sequenced, overall GC content is commonly ~48-51%. I wonder if I should download Illumina files from NCBI SRA, for related fungal species, generated by other research groups, to check whether this deviation from the theoretical distribution is not uncommon. BTW, on basis of what genome reads is this theoretical curve plotted?
Please see attached image 5

6. For Per Base N content, there is a minor bump at position 1. Does this mean that my trimming was not performed as well as it should have been?
Image Link pic 6 - http://bit.ly/2tzZhgs

7. Because of the adapter and quality trimming, I am thinking changes in the Sequence Length Distribution are as expected. Would you agree?
Image Link pic 7 - http://bit.ly/2tq5sjl

8. For the Sequence Duplication Level graphs, I am not sure I understand the difference between the red and blue lines in the sub-panels. Interestinly the only bump is for repeats ~ > 10X, not sequences with fewer or more numbers of repeats. Is this species specific? And I wonder if I should compare this to SRA reads for identical or similar species, sequenced by other research groups. Thoughts?
Image Link pic 8 - http://bit.ly/2uMdaHZ

9. In terms of adapter content- this is what started it all, I saw FastQC return Illumina Universal Adapter content at multiple positions in the original reads, increasing all the way up to the read end. So I decided to run this trim_galore / cutadapt step. It seems totally normal that the adapter content would go away after this step. Correct?
Image Link pic 9 - http://bit.ly/2eEsaSi

THANK YOU!
anandksrao is offline   Reply With Quote
Old 07-24-2017, 05:20 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,585
Default

Cross-posted and answered at Biostars: https://www.biostars.org/p/264114
GenoMax is offline   Reply With Quote
Reply

Tags
fastqc, illumina, k-mer, masking, trimming

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:11 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO