SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat paired-end reads and minimum length sugo RNA Sequencing 2 02-12-2015 04:52 PM
Minimum Read Length for BWA persorrels Bioinformatics 6 10-05-2013 12:03 PM
vcfutils.pl -specifying minimum read depth? Lspoor Bioinformatics 3 05-27-2013 02:20 AM
picard error: Mismatch between read length and quals length writing read shawpa Bioinformatics 0 08-20-2012 06:52 AM
Periodical illumina read length distribution after trimming of low-quality bases luxmare General 4 12-20-2010 04:18 PM

Reply
 
Thread Tools
Old 11-18-2014, 11:58 AM   #1
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default Minimum length for read trimming

Hi everyone,
I will be performing de novo and reference guided assembly of mitochondrial genomes using Geneious, and have a question regarding the minimum recommended read length to trim my reads to.

The libraries I have are Nextera XT prepared from purified organellar DNA (~450 kb genome), paired-end sequenced on MiSeq V3 with 300 bp reads. I am using Trimmomatic to trim my reads, and I notice that after quality trimming (without specifying a min length) that some of the reads are very short.

I am wondering what a good minimum length for trimming would be. I tried trimming to a min length of 250 bp, but that removes a lot of reads and seems too stringent. I am thinking a min length of 100-150 bp may be better.

Any suggestions?
Marisa_Miller is offline   Reply With Quote
Old 11-18-2014, 12:51 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Can you post FastQC plots of this run, if you have them available? It would be useful to see if the problem with low quality bases is due to low nucleotide diversity libraries (which leads to characteristic drops in Q-scores later in read cycles). Was a phiX spike-in used for this run?
GenoMax is offline   Reply With Quote
Old 11-18-2014, 02:57 PM   #3
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by GenoMax View Post
Can you post FastQC plots of this run, if you have them available? It would be useful to see if the problem with low quality bases is due to low nucleotide diversity libraries (which leads to characteristic drops in Q-scores later in read cycles). Was a phiX spike-in used for this run?
I have posted the FastQC results for 2 different libraries, #1 and #10. For each library, I have the FastQC results for the raw untrimmed data, quality trimmed data, and quality trimmed + 250 bp minimum read length data. I will upload the 6th file in a separate reply.

You can see that for library #1, trimming + a min read length of 250 bp leads to about half of the reads being removed, whereas for library #10 only 1/6th of the reads remain after trimming + a min read length of 250 bp (although the total reads and quality were a little lower for #10 than #1 to begin initially). Out of the 14 libraries I sequence, only #10 showed this drastic reduction in reads after trimming with a min length specified.

As for the phiX spike-in, I'm not actually sure but I am asking our genomics center to see if they added it in. I will let you know when I find out.
Attached Files
File Type: pdf 1_raw.pdf (659.3 KB, 40 views)
File Type: pdf 1_trim_250bp_minlen.pdf (625.7 KB, 19 views)
File Type: pdf 1_trim_no_minlen.pdf (624.7 KB, 14 views)
File Type: pdf 10_raw.pdf (628.9 KB, 10 views)
File Type: pdf 10_trim_250bp_minlen.pdf (628.7 KB, 8 views)
Marisa_Miller is offline   Reply With Quote
Old 11-18-2014, 02:57 PM   #4
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by GenoMax View Post
Can you post FastQC plots of this run, if you have them available? It would be useful to see if the problem with low quality bases is due to low nucleotide diversity libraries (which leads to characteristic drops in Q-scores later in read cycles). Was a phiX spike-in used for this run?
Below is the 6th file.
Attached Files
File Type: pdf 10_trim_no_minlen.pdf (591.8 KB, 12 views)
Marisa_Miller is offline   Reply With Quote
Old 11-18-2014, 03:49 PM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

For a 300 cycle runs this data looks pretty good. I think you are being way too aggressive in trimming (if that is only being done on Q-score). What Q-score cut-off are you using? Even for the library #10 the median score for the last cycles is still almost >30 in raw data.

You may want to just trim to remove adapters (if any are present) and then give the downstream analysis a try. If you expect the R1 and R2 reads to overlap in the middle then you should use BBMerge or FLASH to do that first.
GenoMax is offline   Reply With Quote
Old 11-18-2014, 03:59 PM   #6
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Hi Marisa,

From the FastQC stats it looks like your 'raw data' has already been trimmed (probably on the MiSeq, to remove the Illumina adapters), with a min length setting of 32.

In the supplementary data for the Trimmomatic paper, they used a min length setting of 36. You may want to set the min length slightly longer if you are doing assembly, depending on what kmer lengths you want to use for the assembly.
mastal is offline   Reply With Quote
Old 11-18-2014, 04:24 PM   #7
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Hi Marisa,

The minimum length you keep should depend on the assembler you are using. Since you intend to use Geneious, there is probably no harm in keeping all the short reads since it will automatically ignore reads that are too short to be useful to it. However, if you have more data than you need (e.g. over 100 fold mitochondrial coverage), then discarding the shortest reads might be worthwhile to improve performance.
Matt Kearse is offline   Reply With Quote
Old 11-19-2014, 07:11 AM   #8
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by GenoMax View Post
For a 300 cycle runs this data looks pretty good. I think you are being way too aggressive in trimming (if that is only being done on Q-score). What Q-score cut-off are you using? Even for the library #10 the median score for the last cycles is still almost >30 in raw data.

You may want to just trim to remove adapters (if any are present) and then give the downstream analysis a try. If you expect the R1 and R2 reads to overlap in the middle then you should use BBMerge or FLASH to do that first.
These are the trimmomatic settings I was using
Code:
ILLUMINACLIP:$TRIMMOMATIC/adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:250
.

It seems like there aren't any adapters present even before trimming, so perhaps I will just try the unclipped vs clipped data (without specifying a minimum read length) for mapping and assembly. I thought about using FLASH, but I don't necessarily expect my reads to overlap because the bioanalyzer profiles of the libraries show an average size of 1,527 bp.

Also, I checked with the genomics facility here who made my libraries, and yes they included a 1% PhiX spike in the run.

Thanks again for all the helpful suggestions!
Marisa_Miller is offline   Reply With Quote
Old 11-19-2014, 07:12 AM   #9
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by mastal View Post
Hi Marisa,

From the FastQC stats it looks like your 'raw data' has already been trimmed (probably on the MiSeq, to remove the Illumina adapters), with a min length setting of 32.

In the supplementary data for the Trimmomatic paper, they used a min length setting of 36. You may want to set the min length slightly longer if you are doing assembly, depending on what kmer lengths you want to use for the assembly.
Thanks for the suggestion, I will try both "raw data" and min length trimmed data for assembly to see what the results look like.
Marisa_Miller is offline   Reply With Quote
Old 11-19-2014, 07:14 AM   #10
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by Matt Kearse View Post
Hi Marisa,

The minimum length you keep should depend on the assembler you are using. Since you intend to use Geneious, there is probably no harm in keeping all the short reads since it will automatically ignore reads that are too short to be useful to it. However, if you have more data than you need (e.g. over 100 fold mitochondrial coverage), then discarding the shortest reads might be worthwhile to improve performance.
I do have a ridiculous amount of coverage, even after trimming with a min length specified (~400X for most libraries), so perhaps some less stringent trimming will improve performance.
Marisa_Miller is offline   Reply With Quote
Old 11-19-2014, 09:55 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by Marisa_Miller View Post
I don't necessarily expect my reads to overlap because the bioanalyzer profiles of the libraries show an average size of 1,527 bp.
I would not trust that number. Perhaps molecules in the library have that kind of size, but they would not be expected to amplify and cluster correctly on the sequencing machine; I am told that 800bp is kind of the limit. It's best to make a draft assembly and map to it to determine the real insert size distribution. You can alternately get a quick idea of the insert distribution by merging, which is really fast -

bbmerge.sh in=reads.fq ihist=ihist.txt reads=100000

...will take about a second. The fraction joined and the position of the peak in the graph will make it clear what the real distribution is like. If the graph is still rising then abruptly drops to zero just before 2x(read length) then the insert sizes are generally too long for merging.

Oh, and I agree with GenoMax that your trimming is too aggressive. Aggressive trimming can cause bias, since some bases and motifs yield lower quality scores, which can cause a worse overall assembly. I don't typically trim to over Q16, and normally stay at Q10 or below.

Also, bear in mind that too much coverage can cause poor assembly (depending on the assembler), so subsampling / normalization to 100x or lower will often improve your results.
Brian Bushnell is offline   Reply With Quote
Old 11-19-2014, 11:07 AM   #12
crazyhottommy
Senior Member
 
Location: Gainesville

Join Date: Apr 2012
Posts: 140
Default

Quote:
Originally Posted by Marisa_Miller View Post
Hi everyone,
I will be performing de novo and reference guided assembly of mitochondrial genomes using Geneious, and have a question regarding the minimum recommended read length to trim my reads to.

The libraries I have are Nextera XT prepared from purified organellar DNA (~450 kb genome), paired-end sequenced on MiSeq V3 with 300 bp reads. I am using Trimmomatic to trim my reads, and I notice that after quality trimming (without specifying a min length) that some of the reads are very short.

I am wondering what a good minimum length for trimming would be. I tried trimming to a min length of 250 bp, but that removes a lot of reads and seems too stringent. I am thinking a min length of 100-150 bp may be better.

Any suggestions?
for trimming phred scores, trim at 5 not 20.
see here http://genomebio.org/nail-in-the-qua...imming-coffin/
and paper here :
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3908319/
crazyhottommy is offline   Reply With Quote
Old 11-19-2014, 11:13 AM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I would not call that conclusive evidence that 5 is a good value for trimming. It highly depends on what you're doing, the trimming algorithm you use, your read length, read quality profile, coverage depth, and a variety of other factors. Not to mention that the paper's granularity was too coarse.
Brian Bushnell is offline   Reply With Quote
Old 11-20-2014, 09:00 AM   #14
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by Brian Bushnell View Post
I would not trust that number. Perhaps molecules in the library have that kind of size, but they would not be expected to amplify and cluster correctly on the sequencing machine; I am told that 800bp is kind of the limit. It's best to make a draft assembly and map to it to determine the real insert size distribution. You can alternately get a quick idea of the insert distribution by merging, which is really fast -

bbmerge.sh in=reads.fq ihist=ihist.txt reads=100000

...will take about a second. The fraction joined and the position of the peak in the graph will make it clear what the real distribution is like. If the graph is still rising then abruptly drops to zero just before 2x(read length) then the insert sizes are generally too long for merging.

Oh, and I agree with GenoMax that your trimming is too aggressive. Aggressive trimming can cause bias, since some bases and motifs yield lower quality scores, which can cause a worse overall assembly. I don't typically trim to over Q16, and normally stay at Q10 or below.

Also, bear in mind that too much coverage can cause poor assembly (depending on the assembler), so subsampling / normalization to 100x or lower will often improve your results.
Hi Brian,
Thanks for the useful information. I quality trimmed the reads with a lower score limit (Q10) and did not specify a minimum read length.

I merged the reads with your program, and have attached a few files for you to take a look at. It looks like most of the insert sizes of the libraries have a mean of 250-330bp. So I guess it looks like I should try to merge the reads before assembly. Do you have any suggestions on using bbmerge?
Attached Files
File Type: txt 1_R1.fastq.PE_ihist.txt (370 Bytes, 5 views)
File Type: txt 1_R2.fastq.PE_ihist.txt (284 Bytes, 1 views)
File Type: txt 10_R2.fastq.PE_ihist.txt (230 Bytes, 1 views)
File Type: txt 10_R1.fastq.PE_ihist.txt (354 Bytes, 1 views)
Marisa_Miller is offline   Reply With Quote
Old 11-20-2014, 10:04 AM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Quote:
Originally Posted by Marisa_Miller View Post
Do you have any suggestions on using bbmerge?
Check out the BBMerge thread for nuggets of knowledge: http://seqanswers.com/forums/showthread.php?t=43906
GenoMax is offline   Reply With Quote
Old 11-20-2014, 10:09 AM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh, sorry, the command I gave you was for interleaved reads, so those are all false positive merges. For pairs in separate files, the command would be:

bbmerge.sh in1=R1.fq in2=R2.fastq ihist=ihist.txt reads=100000

As for whether you merge the reads before assembly, that depends on the assembler. AllPathsLG does its own merging; Ray seems to do well with merged reads; SoapDenovo creates worse assemblies with merged reads; there's no single rule. I've never used Geneious, so I don't know if it would help. As with many options, such as trimming and subsampling, sometimes the only way to get the best assembly is to try both ways.

If you do merge the reads, I suggest using the default settings:

bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq outu1=unmerged1.fq outu2=unmerged2.fq

...then feed the assembler both the merged and unmerged reads. Many or most assemblers will accept both paired and unpaired reads; merging should not be done for assemblers that don't allow you to feed them both paired and unpaired reads simultaneously, as low-complexity genomic regions will not merge as well.
Brian Bushnell is offline   Reply With Quote
Old 11-20-2014, 12:09 PM   #17
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by Brian Bushnell View Post
The fraction joined and the position of the peak in the graph will make it clear what the real distribution is like. If the graph is still rising then abruptly drops to zero just before 2x(read length) then the insert sizes are generally too long for merging.
Hi Brian,
I re-ran bbmerge with the correct command, and it looks like the distribution shows the insert sizes are too big for merging. Although, the fraction joined is around 60% for most libraries, not sure if this is good or bad.
Attached Files
File Type: txt ihist_1.txt (4.1 KB, 6 views)
Marisa_Miller is offline   Reply With Quote
Old 11-20-2014, 12:24 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default



For a 2x300bp library, getting 60% merging and a median of 400bp is pretty optimal for merging, actually. Again, whether merging is a good idea depends on the assembler, but this library is a good. You never see 90%+ merging unless the insert sizes came out way too short.
Attached Images
File Type: png ihist.png (11.3 KB, 91 views)
Brian Bushnell is offline   Reply With Quote
Old 11-20-2014, 12:27 PM   #19
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Quote:
Originally Posted by Brian Bushnell View Post


For a 2x300bp library, getting 60% merging and a median of 400bp is pretty optimal for merging, actually. Again, whether merging is a good idea depends on the assembler, but this library is a good. You never see 90%+ merging unless the insert sizes came out way too short.
I think I misunderstood earlier about what to look for in a library to see if it can be merged. I will go ahead and merge them and give the assembly a shot with both merged and unmerged. Thanks again!
Marisa_Miller is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO