SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: A signal-noise model for significance analysis of ChIP-seq with negative cont Newsbot! Literature Watch 0 05-09-2010 07:00 PM

Reply
 
Thread Tools
Old 03-08-2020, 12:29 PM   #1
yy273826987
Junior Member
 
Location: Cincinnati, Ohio, USA

Join Date: Mar 2020
Posts: 6
Default Shotgun Meta of Environ Sam: Per Base Seq Cont Per Seq GC Cont failed aft trimming

Dear all,

I am really a newbie for analyzing shotgun metagenomics data. Here I encountered some issues when I checked the quality of my data. I post my concerns here and hope someone can help me.

DNA samples: Genomic DNA isolated from environmental samples (soil, sewage, or freshwater). We are interested in the community structures of bacteria and archaea in those samples as well as detecting functional genes.

Sequencing platform: Illumina, Shallow Metagenomics, Shotgun sequencing of DNA, Paired-end sequencing

Library: Nextera kits (I got this information when running TrimGalore!)

Concern-1: Per Base Sequence Content
Before trimming, I checked the quality of the raw data using FastQC + MultiQC. Many samples failed the Per Base Sequence Content test with biased composition at the 5-end (see the attached Per Base Sequence Content-No trimming.jpg), and all samples failed the Adapter Content test (see the attached Adapter Content--No trimming.jpg). I then thought that I needed to trim the 5-end by removing 15 bp from each read and also trim the adapters. I trimmed all the raw reads with TrimGalore! with the following command:
===============
~/TrimGalore-0.6.5/trim_galore --clip_R1 15 --clip_R2 15 --paired read_1_sample_1.fastq.gz read_2_sample_2.fastq.gz read_1_sample_2.fastq.gz read_2_sample_2.fastq.gz read_1_sample_N.fastq.gz read_2_sample_N.fastq.gz
===============
After the trimming, I ran FastQC + MultiQC and found that, surprisingly, all samples failed the Per Base Sequence Content test. I found that all samples shared the same pattern: the 3-end is significantly biased with the content of C being very low (see the attached Per Base Sequence Content-After trimming.jpg).
My question is, should I worry about the bias at the 3-end? Or, should I further trim the 3-end? Specifically, the curve/line for C is roughly horizontal before the trimming. Why this curve/line dropped to almost zero after the trimming? An online discussion (https://github.com/FelixKrueger/Trim...-auto-detectio) mentioned that [Note that the sharp decrease of A at the last position is a result of removing the adapter sequence very stringently, i.e. even a single trailing A at the end is removed.] However, as far as I can understand, the trimming at the 3-end just means removing the sequencing of the adapter (if there is sequencing read-through). The trimming should not affect the remaining (i.e., the sequence that is kept) sequences. If the curve of C before the trimming is horizontal, it should also be horizontal after the trimming. I am a bit confused.

Concern-2: Per Sequence GC Content
Before trimming, I found that many samples failed the Per Sequence GC Content test because of the multiple peaks in the plot (see the attached Per Sequence GC Content--No trimming.jpg). I thought that this failure was due to adapter contamination. However, after trimming, many samples still have the issue (see the attached Per Sequence GC Content--After trimming.jpg).

My question is, why my samples show multiple peaks? Is it possible that my samples contain more than one dominant species? Or, the multiple peaks were due to sequencing/process errors? How should I fix this issue?

Question-3: The sequencing I did is shallow sequencing. Also, my samples are not pure culture samples--they contain millions of different species of microbes. We will examine the microbial community structure and detect/find functional genes. In this case, should I do assembly before the downstream analysis? I read some online discussions. Some suggest assembly, and some say that it is better to skip the assembly. I am really new in this area and do not know which (with vs. without assembly) is a better choice.

Thanks for reading this posting!
yy273826987 is offline   Reply With Quote
Old 03-09-2020, 10:38 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Rule #1: Do not get hung up on the big red X's in FastQC.

The thresholds which delineate Pass|Warn|Fail for the various metrics in FastQC were set using beautiful, single species, perfectly random and uniform genomic DNA libraries. Things that deviate from this in terms of sampling method, library content and library construction produce false failures. It is likely that the data is perfectly good for your organism(s), given that you are performing a metagenomic experiment with widely variable samples.

You stated that you made these libraries using a Nextera kit. The tagmentation in Nextera library kits is not perfectly random, there is a sequence composition bias for the tagmentation site. Your original (untrimmed) Per Base Sequence content is perfectly normal for Nextera libraries; the bias at the 5' end simply shows the bias of the tagmentation enzyme. There is no need to trim the 5' end but if you want to go ahead.

The highly skewed 3' end in the Per Base Sequence content plot after trimming I have seen before with trimmed reads. I'm not sure if it is an artifact of trimming or of the grouping algorithm in FastQC when it doesn't have enough bases left to include in its default group size of 5bp. (This is purely speculation.)

Regarding the GC content plots, you are sampling a large diversity of bacteria from a variety of very distinct environments. It is totally expected that the bacterial populations in your different environments would have widely variable GC content distributions. This has nothing to do with adapters. Again, the failure is due to FastQC's expectations not matching the reality of the experiment you are performing.

The Adapter content plot is the only one which really shows something you need to address. It is normal (especially for libraries prepared using Nextera kits) to have some fragments shorter than your read length (150bp in your case). Your particular libraries vary from ~20% to 35% in the percentage of fragments < 150bp. Performing 3' adapter trimming is required to remove adapter sequences from these reads.

Last edited by kmcarr; 03-10-2020 at 11:19 AM. Reason: Correct 5'/3' mixup
kmcarr is offline   Reply With Quote
Old 03-10-2020, 11:17 AM   #3
yy273826987
Junior Member
 
Location: Cincinnati, Ohio, USA

Join Date: Mar 2020
Posts: 6
Default

Dear kmcarr,

Thanks a lot for the reply and explaining the details. Appreciate that!

After reading your response, I understand that the adapter contamination is the only thing that I need to worry about. I have used TrimGalore! to remove the adapters from the 3'-end of the raw reads. However, you also suggested that "Performing 5' adapter trimming is required to remove adapter sequences from these reads." I am a bit confused. Based on my current understanding (maybe I am wrong), in my case, I only have adapters at the 3'-end of the reads. Do we have adapters at both ends (3'- and 5'-)?

Thanks again!
yy273826987 is offline   Reply With Quote
Old 03-10-2020, 11:20 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by yy273826987 View Post
Dear kmcarr,

Thanks a lot for the reply and explaining the details. Appreciate that!

After reading your response, I understand that the adapter contamination is the only thing that I need to worry about. I have used TrimGalore! to remove the adapters from the 3'-end of the raw reads. However, you also suggested that "Performing 5' adapter trimming is required to remove adapter sequences from these reads." I am a bit confused. Based on my current understanding (maybe I am wrong), in my case, I only have adapters at the 3'-end of the reads. Do we have adapters at both ends (3'- and 5'-)?

Thanks again!
Sorry, that was an error. I meant to type "Performing 3' adapter trimming...."

I have edited my original post to fix this.
kmcarr is offline   Reply With Quote
Old 03-10-2020, 11:32 AM   #5
yy273826987
Junior Member
 
Location: Cincinnati, Ohio, USA

Join Date: Mar 2020
Posts: 6
Default

Dear kmcarr,

Thanks for the quick response and the clarification.

Here may I have more questions? For my specific case, should I perform assembly before downstream analysis?

Also, after the Quality Control, which software or pipeline would you suggest for me to begin with (for assembly, annotation, taxonomic analysis, and finding functional genes)? I found that there are numerous software and pipelines. As a real newbie, I have a hard time to find which pipeline I shall start with.

Thanks!
yy273826987 is offline   Reply With Quote
Old 03-11-2020, 04:55 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by yy273826987 View Post
Dear kmcarr,

Thanks for the quick response and the clarification.

Here may I have more questions? For my specific case, should I perform assembly before downstream analysis?

Also, after the Quality Control, which software or pipeline would you suggest for me to begin with (for assembly, annotation, taxonomic analysis, and finding functional genes)? I found that there are numerous software and pipelines. As a real newbie, I have a hard time to find which pipeline I shall start with.

Thanks!
yy2,

The downstream analysis part is a bit outside my area so I'll have to leave that to others to help you.

Cheers.
kmcarr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO