![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Tophat paired-end reads and minimum length | sugo | RNA Sequencing | 2 | 02-12-2015 04:52 PM |
Minimum Read Length for BWA | persorrels | Bioinformatics | 6 | 10-05-2013 12:03 PM |
vcfutils.pl -specifying minimum read depth? | Lspoor | Bioinformatics | 3 | 05-27-2013 02:20 AM |
picard error: Mismatch between read length and quals length writing read | shawpa | Bioinformatics | 0 | 08-20-2012 06:52 AM |
Periodical illumina read length distribution after trimming of low-quality bases | luxmare | General | 4 | 12-20-2010 04:18 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]()
Hi everyone,
I will be performing de novo and reference guided assembly of mitochondrial genomes using Geneious, and have a question regarding the minimum recommended read length to trim my reads to. The libraries I have are Nextera XT prepared from purified organellar DNA (~450 kb genome), paired-end sequenced on MiSeq V3 with 300 bp reads. I am using Trimmomatic to trim my reads, and I notice that after quality trimming (without specifying a min length) that some of the reads are very short. I am wondering what a good minimum length for trimming would be. I tried trimming to a min length of 250 bp, but that removes a lot of reads and seems too stringent. I am thinking a min length of 100-150 bp may be better. Any suggestions? |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
Can you post FastQC plots of this run, if you have them available? It would be useful to see if the problem with low quality bases is due to low nucleotide diversity libraries (which leads to characteristic drops in Q-scores later in read cycles). Was a phiX spike-in used for this run?
|
![]() |
![]() |
![]() |
#3 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
You can see that for library #1, trimming + a min read length of 250 bp leads to about half of the reads being removed, whereas for library #10 only 1/6th of the reads remain after trimming + a min read length of 250 bp (although the total reads and quality were a little lower for #10 than #1 to begin initially). Out of the 14 libraries I sequence, only #10 showed this drastic reduction in reads after trimming with a min length specified. As for the phiX spike-in, I'm not actually sure but I am asking our genomics center to see if they added it in. I will let you know when I find out. |
|
![]() |
![]() |
![]() |
#4 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
For a 300 cycle runs this data looks pretty good. I think you are being way too aggressive in trimming (if that is only being done on Q-score). What Q-score cut-off are you using? Even for the library #10 the median score for the last cycles is still almost >30 in raw data.
You may want to just trim to remove adapters (if any are present) and then give the downstream analysis a try. If you expect the R1 and R2 reads to overlap in the middle then you should use BBMerge or FLASH to do that first. |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
Hi Marisa,
From the FastQC stats it looks like your 'raw data' has already been trimmed (probably on the MiSeq, to remove the Illumina adapters), with a min length setting of 32. In the supplementary data for the Trimmomatic paper, they used a min length setting of 36. You may want to set the min length slightly longer if you are doing assembly, depending on what kmer lengths you want to use for the assembly. |
![]() |
![]() |
![]() |
#7 |
Member
Location: New Zealand Join Date: Mar 2014
Posts: 20
|
![]()
Hi Marisa,
The minimum length you keep should depend on the assembler you are using. Since you intend to use Geneious, there is probably no harm in keeping all the short reads since it will automatically ignore reads that are too short to be useful to it. However, if you have more data than you need (e.g. over 100 fold mitochondrial coverage), then discarding the shortest reads might be worthwhile to improve performance. |
![]() |
![]() |
![]() |
#8 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
Code:
ILLUMINACLIP:$TRIMMOMATIC/adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:250 It seems like there aren't any adapters present even before trimming, so perhaps I will just try the unclipped vs clipped data (without specifying a minimum read length) for mapping and assembly. I thought about using FLASH, but I don't necessarily expect my reads to overlap because the bioanalyzer profiles of the libraries show an average size of 1,527 bp. Also, I checked with the genomics facility here who made my libraries, and yes they included a 1% PhiX spike in the run. Thanks again for all the helpful suggestions! |
|
![]() |
![]() |
![]() |
#9 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#11 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
bbmerge.sh in=reads.fq ihist=ihist.txt reads=100000 ...will take about a second. The fraction joined and the position of the peak in the graph will make it clear what the real distribution is like. If the graph is still rising then abruptly drops to zero just before 2x(read length) then the insert sizes are generally too long for merging. Oh, and I agree with GenoMax that your trimming is too aggressive. Aggressive trimming can cause bias, since some bases and motifs yield lower quality scores, which can cause a worse overall assembly. I don't typically trim to over Q16, and normally stay at Q10 or below. Also, bear in mind that too much coverage can cause poor assembly (depending on the assembler), so subsampling / normalization to 100x or lower will often improve your results. |
|
![]() |
![]() |
![]() |
#12 | |
Senior Member
Location: Gainesville Join Date: Apr 2012
Posts: 140
|
![]() Quote:
see here http://genomebio.org/nail-in-the-qua...imming-coffin/ and paper here : http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3908319/ |
|
![]() |
![]() |
![]() |
#13 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I would not call that conclusive evidence that 5 is a good value for trimming. It highly depends on what you're doing, the trimming algorithm you use, your read length, read quality profile, coverage depth, and a variety of other factors. Not to mention that the paper's granularity was too coarse.
|
![]() |
![]() |
![]() |
#14 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
Thanks for the useful information. I quality trimmed the reads with a lower score limit (Q10) and did not specify a minimum read length. I merged the reads with your program, and have attached a few files for you to take a look at. It looks like most of the insert sizes of the libraries have a mean of 250-330bp. So I guess it looks like I should try to merge the reads before assembly. Do you have any suggestions on using bbmerge? |
|
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
Check out the BBMerge thread for nuggets of knowledge: http://seqanswers.com/forums/showthread.php?t=43906
|
![]() |
![]() |
![]() |
#16 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Oh, sorry, the command I gave you was for interleaved reads, so those are all false positive merges. For pairs in separate files, the command would be:
bbmerge.sh in1=R1.fq in2=R2.fastq ihist=ihist.txt reads=100000 As for whether you merge the reads before assembly, that depends on the assembler. AllPathsLG does its own merging; Ray seems to do well with merged reads; SoapDenovo creates worse assemblies with merged reads; there's no single rule. I've never used Geneious, so I don't know if it would help. As with many options, such as trimming and subsampling, sometimes the only way to get the best assembly is to try both ways. If you do merge the reads, I suggest using the default settings: bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq outu1=unmerged1.fq outu2=unmerged2.fq ...then feed the assembler both the merged and unmerged reads. Many or most assemblers will accept both paired and unpaired reads; merging should not be done for assemblers that don't allow you to feed them both paired and unpaired reads simultaneously, as low-complexity genomic regions will not merge as well. |
![]() |
![]() |
![]() |
#17 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
I re-ran bbmerge with the correct command, and it looks like the distribution shows the insert sizes are too big for merging. Although, the fraction joined is around 60% for most libraries, not sure if this is good or bad. |
|
![]() |
![]() |
![]() |
#18 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() For a 2x300bp library, getting 60% merging and a median of 400bp is pretty optimal for merging, actually. Again, whether merging is a good idea depends on the assembler, but this library is a good. You never see 90%+ merging unless the insert sizes came out way too short. |
![]() |
![]() |
![]() |
#19 | |
Member
Location: St. Paul, MN Join Date: Aug 2010
Posts: 34
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|