SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
shotgun metagenomic sequencing coverage neokao Metagenomics 4 06-16-2015 02:54 PM
Comparing shotgun metagenomic reads: 454 vs HiSeq thesoundd Metagenomics 7 05-16-2015 12:39 PM
The software for metagenomic DNA shotgun sequencing mot Bioinformatics 3 08-13-2014 04:38 PM

Reply
 
Thread Tools
Old 04-25-2017, 07:14 AM   #1
lwebs
Junior Member
 
Location: MA

Join Date: Mar 2017
Posts: 7
Default Quality-filtering shotgun metagenomic sequences from environmental samples advice

Hello all!

I am analyzing illumina Hiseq4000 - generated paired-end shotgun metagenomic sequences obtained from environmental samples. I am also new to shotgun metagnomic data, but have had experience analyzing 16S data.

The reads are 150 nt in length and a majority of the fragment sizes range from 280-700 bp. A few samples have fragment sizes ranging from 80- 600 bp.

I am using the illumina-utils program to quality filter reads before de-novo assembly with the iu-filter-quality-minoche flag (see here for more info: https://github.com/merenlab/illumina-utils).

So far, approximately 68% of both R1 and R2 pass the QC parameters while 32% fail (94% percent of failures due to R2).

Here are my questions: Is this error rate and magnitude for read 2 normal?
Should I quality filter the reads prior to merging some
of the reads (if only about 20% can be merged)?
Can I use both merged reads and unmerged R1 and R2
for de novo assembly using Megahit?

Thanks for the help!
Any guidance would be appreciated!
lwebs is offline   Reply With Quote
Old 04-25-2017, 12:33 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by lwebs View Post
I am using the illumina-utils program to quality filter reads before de-novo assembly with the iu-filter-quality-minoche flag (see here for more info: https://github.com/merenlab/illumina-utils).

So far, approximately 68% of both R1 and R2 pass the QC parameters while 32% fail (94% percent of failures due to R2).

Here are my questions: Is this error rate and magnitude for read 2 normal
That's extremely high. Either you have a failed sequencing run, or your threshold is much too strict. It would be useful to post a quality-score boxplot, though. Anyway, quality-trimming is generally better than filtering, as it both allows you to retain more useful data, and remove more bad data.

Consulting your link:

Quote:
C33: less than 2/3 of bases were Q30 or higher in the first half of the read following the B-tail trimming
That sounds too aggressive of a threshold for an optimal metagenome assembly; it will result in low genome recovery, and likely, higher fragmentation (though I encourage you to verify this yourself). I'd suggest something more like Q10 trimming of the right end (which you can do with BBDuk flags qtrim=r trimq=10), but the exact value depends on the dataset. Also, since adapter-trimming is universally positive while quality-trimming is more conditionally-positive, I encourage you to adapter-trim the data prior to doing anything else.

Quote:
Should I quality filter the reads prior to merging some of the reads (if only about 20% can be merged)?
I recommend trimming rather than filtering, but I don't recommend either prior to merging. BBMerge, incidentally, can do iterative quality-trimming only for reads that fail to merge without trimming, which improves the merge rate. Blanket quality-trimming all reads prior to merging can increase false-positive merges and reduce the merge rate due to fewer overlapping pairs.

Also, BBMerge can merge non-overlapping reads, if you have high enough coverage; this is useful in this kind of scenario where only 20% of the reads overlap due to a large average insert size.

Quote:
Can I use both merged reads and unmerged R1 and R2 for de novo assembly using Megahit?
You should always use both merged and unmerged reads for assembly. But in my testing, while merging improves metagenomic assemblies from Spades and Ray, it does not improve them for Megahit, so I don't recommend it as a preprocessing step for Megahit.
Brian Bushnell is offline   Reply With Quote
Old 05-03-2017, 10:40 AM   #3
lwebs
Junior Member
 
Location: MA

Join Date: Mar 2017
Posts: 7
Default

Thank you for the advice Brian. I am trying out bbtools (bbduk and bbmerge).

I just got bbduk to run, but now I can't find the output files on my system . . . do I have to have existing directories to accept these files?

Below is the command I just ran:
bbduk.sh in1=1_ATGAGGCCAC_L007_R1_001.fastq in2=1_ATGAGGCCAC_L007_R2_001.fastq out1=1_cleanR1.fq out2=1_cleanR2.fq ref=/data/laura/Extracted_Metagenomes/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
lwebs is offline   Reply With Quote
Old 05-03-2017, 10:56 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,992
Default

The result files should have gone to the directory you ran the command from. Unless there was an error (i.e. you don't have write permission to the directory original data is in).

Last edited by GenoMax; 05-03-2017 at 10:59 AM.
GenoMax is offline   Reply With Quote
Old 05-03-2017, 10:57 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The output files should be in your working directory, the same directory as the input files. What do you get when you run "ls *.f*"?
Brian Bushnell is offline   Reply With Quote
Old 05-03-2017, 10:59 AM   #6
lwebs
Junior Member
 
Location: MA

Join Date: Mar 2017
Posts: 7
Default

Thanks! found them!
lwebs is offline   Reply With Quote
Old 05-03-2017, 01:04 PM   #7
lwebs
Junior Member
 
Location: MA

Join Date: Mar 2017
Posts: 7
Default

I am also looking for programs/ scripts that would allow me to combine both the merged and orphaned PE reads into one file to use for assembly via Megahit. Any suggestions?

I tried to cat the files together and megahit rejected the file with the output 'number of paired-end files not match!'.
lwebs is offline   Reply With Quote
Old 05-03-2017, 01:20 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Don't cat paired and unpaired reads. For Megahit, you need to use the -r flag, like this:

Code:
megahit --12 paired.fq -r singletons.fq
Brian Bushnell is offline   Reply With Quote
Old 05-03-2017, 01:30 PM   #9
lwebs
Junior Member
 
Location: MA

Join Date: Mar 2017
Posts: 7
Default

Thank you! You have been a tremendous help!
lwebs is offline   Reply With Quote
Reply

Tags
environmental samples, illumina-utils, quality-filtering, shotgun metagenomics

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO