SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Insert size distribution for SV detection BnaT Bioinformatics 1 11-12-2014 08:59 AM
BWA insert size distribution too large etwatson Bioinformatics 1 10-31-2013 01:16 PM
custom ssDNA library with strange size distribution of qPCR products pyridine Sample Prep / Library Generation 4 07-17-2013 04:18 AM
Bowtie2 insert size=0 manore Bioinformatics 2 12-26-2012 02:03 AM
Bimodal insert size distribution Pepe Bioinformatics 1 03-03-2010 05:10 AM

Reply
 
Thread Tools
Old 02-05-2015, 07:13 AM   #21
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Hmmm..I don't think I have bbduk working correctly.

Original Fastq file:

@M00561:19:000000000-ABAUW:1:1101:10946:1435 1:N:0:9
TCTCCCTTTTATCTTTACATACTGTCGTTCATTATCCTCTTATCTTATCAAACCTTGCTTTTCATCTTTCTTTTTTTTTTTTTTCTTTTCTCTTTCTTTTCTCCTTTCTACCCTCTATTTTGTTTTCTTTTTTTCTTCAGTTTTTCATCTTTTATTATTCTCTTTTATCTCTGTTTGGATAGCGCCTGACTGAAGTTATAACTTCGGTCATGTTATCTGT
+
CBCCCFFFFE9FAFFG9@,C,C@,CC8FF<8CC,CC@F,CC,C,;C,C@,,,;6CE,6,<CC@,,;<CC@,<C,,+78@++7++6<C,<6<,,,<<C?,,5,,,,,<,,,9,,:9,,,9,,,::,:@???<=++8,88,,,,:<8,,,5,,,:,,:,,,5,,,::A,,575,,:8,,+,,,,3+38++,8>,,,,,,,,,,7>@=**66*,,77,@@>,,
@M00561:19:000000000-ABAUW:1:1101:13213:1452 1:N:0:9
CCGTATTACCTGCCGCATCATTGTGAGTTGAAGATACATGTGCGGTTGATTTTATCTGGCTAGGCTACGTATTTCTATTTTTTTTCTCCCTTTCTTTTCCTTTTTTCTTTTTTTTTTCTTTTTTCTTTTCTTCTTCTGTTTTTTTTTATTTTTTCTTTTCTTACTCTTTTACTCCAGTGCCTTCAGATGTCTTTTTCTTCGCATTTTCCATTCTTTTTTATTTTTCCTATTTCCTTTATTTCCTCTACCCATCGTTAATATCACTTTCGTTTACTTTCACGTTAGGTTATACCG
+
8@<A9EF-C@C<AC+@+B@,CF9C9,,CF,,,,,C,C,C,<,6,+BF+,ECEF,C@E,,;C,,,6,,;,,,,,,69,<,9,,8++:96,::,,6<CC,969,,9,+4,,,<9+++4+4<,,9:@4,<8<8<,,,,5,,<,8,:6++8,::B,,+5,:,:7>,,7,7@,,7,7@>3,,7,77,77,,,7,,7@@<DB>,,6***66@,42,66,6,6,,3*5++5+35++++++3++5+5+5+3+54+2+*+30**2**+**0*2::C***09*2*2*1***0*)))19*2)/*)
@M00561:19:000000000-ABAUW:1:1101:15428:1464 1:N:0:9
CACCACCTCTTTTCATGGTACCATTTGCACGCTCCAAACTTGCATAGTGACCTTTTTCGATTAATTGACCAAAGTCAACATTATAACCGTCCTTTTTTTCATCCCTCTTTTCCTCCTCTCTCTTCTCCTCCTCTCCCATCCCCCTATTCAACAGGCCATCTCGTTTTCCTTCTTCTTCTTTTTAACCAATATTTTCTTTTCTTTTCCTTCTTCTTTTTTCCTTTTCCTTTTTTTTTTTTCTCTCTTTTCTCTTCTTTTTTTCCCTCTTTTCTCCTTTTTTTCTTTCTTTTTTCTTCTTT
+
8-BB9FFGFFGGGG9F9-C<EE9FFFACAF,@@@@<89DFE9F9F,<C69CCFFEFFE+,<C,,<,,,<;,,,,<,,,9,,,,,,,<,+88@@C,,:++9,,<,:,9:,<<66994=,<:5:9C,<,5,4,94994,,,9:4+8++,,:,,,,,,4+,448:,:+6+:7,:8???@,7A<,,,,,,,,+,,,,,,,,,,,,33,,,,,,,,,,,,,,,+,,,,,,,,,,,,,*******,1++++++++++++++++++))+**+**0*0***0**/**)))********)))****-*
@M00561:19:000000000-ABAUW:1:1101:12573:1467 1:N:0:9
CAGCTTATCACCCCGGAATTGGTTTATCCGGAGATGGGGTCTTATGGCTGGAAGAGGCCAGCACCTTTTCTCCCTCCTTTTCTCTTCTGCCGGCCCTTTATATTCCACTCGTATTTTTTGTTTTCTTTCCCTTTCTTACTTTTAACCTCTTCTTGTCTCCTATGTGACCAGCCTCTATTTTTTATTATAATTTTGATAACGTTTGTCTGCTCTTTATCTCCTTCACTTCTTGTTACCTATTTTCTCTCTTCTTCGTGTTTTTAGTGCCTTGGTCTGCCGCAGCGGGCGTGCTTGTTGAC

Cleaned FASTQ:


@M00561:19:000000000-ABAUW:1:1101:10946:1435 1:N:0:9
TCTCCCTTTTATCTT
+
CBCCCFFFFE9FAFF
@M00561:19:000000000-ABAUW:1:1101:13609:1492 1:N:0:9
TTGTAAAGCATCG
+
BC<CCCD<F9FF>
@M00561:19:000000000-ABAUW:1:1101:17917:1554 1:N:0:9
TGCTGGACCTGTG
+
6-AAB9EFGG8F,
@M00561:19:000000000-ABAUW:1:1101:10142:1572 1:N:0:9
TTACTGGCGTCCTTGCTTTCTCCTTC


It appears to have truncated all my reads by a massive amount.
M4TTN is offline   Reply With Quote
Old 02-05-2015, 07:18 AM   #22
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,089
Default

Yikes!

Can you post your command line for BBDuk? You are trimming the original files, correct (not merged one)?
GenoMax is offline   Reply With Quote
Old 02-05-2015, 07:20 AM   #23
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Sorry my fault. I wrote:

k=28 k=12

instead of

k=28 mink=12

New Output:


Memory: free=1041m, used=19m

Added 126482 kmers; time: 0.091 seconds.
Memory: free=1032m, used=28m

Input is being processed as paired
Started output streams: 0.008 seconds.
Processing time: 69.585 seconds.

Input: 2546118 reads 645049609 bases.
KTrimmed: 11853 reads (0.47%) 597009 bases (0.09%)
Result: 2546034 reads (100.00%) 644452600 bases (99.91%)

Time: 69.691 seconds.
Reads Processed: 2546k 36.53k reads/sec
Bases Processed: 645m 9.26m bases/sec
M4TTN is offline   Reply With Quote
Old 02-05-2015, 07:23 AM   #24
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,089
Default

Much better. So a few had some adapters left over.
GenoMax is offline   Reply With Quote
Old 02-05-2015, 07:37 AM   #25
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

If I understand it correctly adapter fragments shorter than 12 will still be left even after this cleaning process. Shoudl I specify a smaller value for mink than 12 to deal with this?

Or perhaps there is a different tool for trimming based upon the degree of overlap between read pairs?

i.e. IF overlap is less than read length THEN truncate reads to overlap length.
M4TTN is offline   Reply With Quote
Old 02-05-2015, 08:11 AM   #26
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Actually it looks like bbmerge can do this with the tbo flag:

http://seqanswers.com/forums/showpos...0&postcount=11

I wonder if there is any benefit to processing our reads through this pipeline prior to aligning to the reference?
M4TTN is offline   Reply With Quote
Old 02-05-2015, 08:24 AM   #27
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,089
Default

Having reads free of extraneous sequence is going to benefit all downstream analysis. So you should do that trimming. Brian had mentioned about the tbo flag in his earlier post.
GenoMax is offline   Reply With Quote
Old 02-05-2015, 08:26 AM   #28
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

In case anyone is interested, the FASTQ 2x300bp files cleaned up with bbduk with paramters of k=28 mink=12 resulted in two cleaned reads, which when realligned with Bowtie2 and then opened in SeqMonk create a read distribution that looks like this:

https://www.dropbox.com/s/g5zlunp900...0file.png?dl=0

So clearly the residual adapter sequences are causing some issues with Bowtie correctly calling insert length, soemthign that is partially corrected by bbduk (but not fully probably due to the mink=12 parameter).

If we have time tomorrow, we'll redo with the tbo flag.
M4TTN is offline   Reply With Quote
Old 02-05-2015, 08:28 AM   #29
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Quote:
Originally Posted by GenoMax View Post
Having reads free of extraneous sequence is going to benefit all downstream analysis. So you should do that trimming. Brian had mentioned about the tbo flag in his earlier post.
OH dear. My apologies for not reading/understanding it all. Yes thanks to Brian and Genomax. It's been a tiring day, but thanks for all the help given to a complete novice!
M4TTN is offline   Reply With Quote
Old 02-05-2015, 08:45 AM   #30
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,089
Default

A tiring day perhaps, but I am glad it ended well.

Good Luck with the rest of your analysis.
GenoMax is offline   Reply With Quote
Old 02-06-2015, 08:39 AM   #31
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Thanks GenoMax and Brian. The community support here is excellent.

I've not had a chance to reurun the analyses today, but when we do, I'll report back.
M4TTN is offline   Reply With Quote
Old 03-26-2015, 09:37 AM   #32
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Hello again...

I've had some time to get back to playing with bbmerge/duk etc with some new datasets created on the MiSeq, and I am hoping for some input/explanation.

We are trying to push the insert size a little to maximise the value of 600bp sequencing. Previously we are getting average inserts of below 300bp.

In the last run, we used less Ampure beads, and also did a titration on a set of samples with increasing DNA input amount into the NexteraXT reaction. We then sequenced with 1x350 + 1x250bp.

We used Bowtie2 to align, then opened the SAM files in SeqMonk in order to plot the read/insert length histogram.

For unknown reasons, SeqMonk displays a really odd distribution of the new file - with very few reads below 350 bp in length. (See first image, left).

We are pretty sure this is just an error in reading the SAM alignment file because we have seen similar errors from other analysers where whenever the read lengths fully overlap you get a negative TLEN score, resulting in reads that are not plotted on the histogram. However, previously SeqMonk was one of the programs that did not have issues (at least with 2x300bp sequencing).

Now, to try to solve the problem, and to improve the data fed to Bowtie2, I used bbduk to trim and fastq files using the following code:

./bbduk.sh -Xmx1g in1=/Desktop/TRS10B248_S27_L001_R1_001.fastq in2=/Desktop/TRS10B248_S27_L001_R2_001.fastq out1=/Desktop/TRS10B248_R1.fastq out2=/Desktop/TRS10B248_R2.fastq tbo ktrim=r ref=/Downloads/bbmap/resources/nextera.fa.gz k=25 mink=1 hdist=1

Then did Bowtie2 alignment in Ugene and opened the SAM file in SeqMonk.

The resulting histogram is much better, but still has lots of weird peaks in it. (See first image, right)

Finally, we found that using input DNA of 1ng vs input DNA of 12ng had almost no discernible effect on the insert size distribution (see second image).

Can anyone help to explain what causes programmes to:

1. Misinterpret the SAM file insert lengths?
2. Create the weird peaks seen in the second histogram?

Links to images here:

https://www.dropbox.com/s/hrig4gaxoj...right.png?dl=0

https://www.dropbox.com/s/c1obrn8n3t...ottom.png?dl=0

Note: I have also used bbmerge to get insert lengths, but it only detects about 30% overlapping reads in the new samples (versus about 70% in the previous samples purified using more AMpure beads), so bbmerge is unable to calculate the lengths of the others. However, of the reads it does merge, the distribution is fairly curved/continuous, without any weird peaks (see last image).

https://www.dropbox.com/s/9ndhqt4upg...ogram.pdf?dl=0

Last edited by M4TTN; 03-26-2015 at 09:41 AM.
M4TTN is offline   Reply With Quote
Old 03-26-2015, 03:13 PM   #33
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I just posted a new version of BBMerge (in BBMap 34.72) that has been updated to have a substantially higher merge rate with very long overlaps with a high error rate. For extremely long reads (like 350bp), or in fact whenever you are using it to generate an insert size histogram, I recommend you use the new version and run with the "vloose" flag.

Insert sizes from overlapping will, of course, drop to zero before you get to the sum of the read lengths. But insert sizes by mapping are also inaccurate for another reason, and may drop to zero once the insert size drops below the length of your longer read, because the presence of adapter sequence will prevent reads from mapping (unless you allow clipped local alignments). There are also some programs that calculate insert size incorrectly for reads with insert size shorter than read length. Because you have a huge spike at around 300bp, I believe what you're seeing is probably both of these combined - bowtie2's mapping rate is rapidly dropping once insert size goes below 350bp due to increasing adapter sequence, and SeqMonk is incorrectly calculating the insert size of reads that are mapped with insert sizes shorter than read length; those two things will combine to give a false peak at right around read length. And, that's why the insert size histogram looks better after trimming - but there's still a peak present, presumably because BBDuk missed some of the adapters. If you did not already do this, I encourage you to do the adapter trimming with these flags, to be more thorough:

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=2 hdist2=1 tbo

That should get virtually all of the adapter sequence.

I don't have any ideas about the spikes between 500bp and 600bp, or why they show up in the mapping graph but not the overlap graph, unless they are due to a misassembly in the reference.

Last edited by Brian Bushnell; 03-26-2015 at 03:55 PM.
Brian Bushnell is offline   Reply With Quote
Old 03-26-2015, 03:39 PM   #34
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Thanks Brian.

Will those options work better than what I already used?:

tbo ktrim=r k=25 mink=1 hdist=1

Are you able to explain what hdist=2 and hidst2=1 do?

I've downloaded the new package, but I cannot see a description of these options listed when I type ./bbduk.sh in terminal:

Code:
Written by Brian Bushnell
Last modified March 22, 2015

Description:  Compares reads to the kmers in a reference dataset, optionally 
allowing an edit distance. Splits the reads into two outputs - those that 
match the reference, and those that don't. Can also trim (remove) the matching 
parts of the reads rather than binning the reads.

Usage:  bbduk.sh in=<input file> out=<output file> ref=<contaminant files>

Input may be stdin or a fasta, fastq, or sam file, compressed or uncompressed.
Output may be stdout or a file.
If you pipe via stdin/stdout, please include the file type; e.g. for gzipped 
fasta input, set in=stdin.fa.gz

Optional parameters (and their defaults)

Input parameters:

in=<file>           Main input. in=stdin.fq will pipe from stdin.
in2=<file>          Input for 2nd read of pairs in a different file.
ref=<file,file>     Comma-delimited list of reference files.
literal=<seq,seq>   Comma-delimited list of literal reference sequences.
touppercase=f       (tuc) Change all bases upper-case.
interleaved=auto    (int) t/f overrides interleaved autodetection.
qin=auto            Input quality offset: 33 (Sanger), 64, or auto.
reads=-1            If positive, quit after processing X reads or pairs.
copyundefined=f     (cu) Process non-AGCT IUPAC reference bases by making all
                    possible unambiguous copies.  Intended for short motifs
                    or adapter barcodes, as time/memory use is exponential.

Output parameters:

out=<file>          (outnonmatch) Write reads here that do not contain 
                    kmers matching the database.  'out=stdout.fq' will pipe 
                    to standard out.
out2=<file>         (outnonmatch2) Use this to write 2nd read of pairs to a 
                    different file.
outm=<file>         (outmatch) Write reads here that contain kmers matching
                    the database.
outm2=<file>        (outmatch2) Use this to write 2nd read of pairs to a 
                    different file.
outs=<file>         (outsingle) Use this to write singleton reads whose mate 
                    was trimmed shorter than minlen.
stats=<file>        Write statistics about which contamininants were detected.
refstats=<file>     Write statistics on a per-reference-file basis.
rpkm=<file>         Write RPKM for each reference sequence (for RNA-seq).
dump=<file>         Dump kmer tables to a file, in fasta format.
duk=<file>          Write statistics in duk's format.
nzo=t               Only write statistics about ref sequences with nonzero hits.
overwrite=t         (ow) Grant permission to overwrite files.
showspeed=t         (ss) 'f' suppresses display of processing speed.
ziplevel=2          (zl) Compression level; 1 (min) through 9 (max).
fastawrap=80        Length of lines in fasta output.
qout=auto           Output quality offset: 33 (Sanger), 64, or auto.
statscolumns=3      (cols) Number of columns for stats output, 3 or 5.
                    5 includes base counts.
rename=f            Rename reads to indicate which sequences they matched.
refnames=f          Use names of reference files rather than scaffold IDs.
trd=f               Truncate read and ref names at the first whitespace.
ordered=f           Set to true to output reads in same order as input.

Histogram output parameters:

bhist=<file>        Base composition histogram by position.
qhist=<file>        Quality histogram by position.
aqhist=<file>       Histogram of average read quality.
bqhist=<file>       Quality histogram designed for box plots.
lhist=<file>        Read length histogram.
gchist=<file>       Read GC content histogram.
gcbins=100          Number gchist bins.  Set to 'auto' to use read length.

Histograms for sam files only (requires sam format 1.4 or higher):

ehist=<file>        Errors-per-read histogram.
qahist=<file>       Quality accuracy histogram of error rates versus quality 
                    score.
indelhist=<file>    Indel length histogram.
mhist=<file>        Histogram of match, sub, del, and ins rates by read location.
idhist=<file>       Histogram of read count versus percent identity.
idbins=100          Number idhist bins.  Set to 'auto' to use read length.

Processing parameters:

k=27                Kmer length used for finding contaminants.  Contaminants 
                    shorter than k will not be found.  k must be at least 1.
rcomp=t             Look for reverse-complements of kmers in addition to 
                    forward kmers.
maskmiddle=t        (mm) Treat the middle base of a kmer as a wildcard, to 
                    increase sensitivity in the presence of errors.
maxbadkmers=0       (mbk) Reads with more than this many contaminant kmers 
                    will be discarded.
hammingdistance=0   (hdist) Maximum Hamming distance for ref kmers (subs only).
                    Memory use is proportional to (3*K)^hdist.
qhdist=0            Hamming distance for query kmers; impacts speed, not memory.
editdistance=0      (edist) Maximum edit distance from ref kmers (subs 
                    and indels).  Memory use is proportional to (8*K)^edist.
hammingdistance2=0  (hdist2) Sets hdist for short kmers, when using mink.
qhdist2=0           Sets qhdist for short kmers, when using mink.
editdistance2=0     (edist2) Sets edist for short kmers, when using mink.
forbidn=f           (fn) Forbids matching of read kmers containing N.
                    By default, these will match a reference 'A' if 
                    hdist>0 or edist>0, to increase sensitivity.
removeifeitherbad=t (rieb) Paired reads get sent to 'outmatch' if either is 
                    match (or either is trimmed shorter than minlen).  
                    Set to false to require both.
findbestmatch=f     (fbm) If multiple matches, associate read with sequence 
                    sharing most kmers.  Reduces speed.
skipr1=f            Don't do kmer-based operations on read 1.
skipr2=f            Don't do kmer-based operations on read 2.

Speed and Memory parameters:

threads=auto        (t) Set number of threads to use; default is number of 
                    logical processors.
prealloc=f          Preallocate memory in table.  Allows faster table loading 
                    and more efficient memory usage, for a large reference.
monitor=f           Kill this process if it crashes.  monitor=600,0.01 would 
                    kill after 600 seconds under 1% usage.
minrskip=1          (mns) Force minimal skip interval when indexing reference 
                    kmers.  1 means use all, 2 means use every other kmer, etc.
maxrskip=99         (mxs) Restrict maximal skip interval when indexing 
                    reference kmers. Normally all are used for scaffolds<100kb, 
                    but with longer scaffolds, up to K-1 are skipped.
rskip=              Set both minrskip and maxrskip to the same value.
                    If not set, rskip will vary based on sequence length.
qskip=1             Skip query kmers to increase speed.  1 means use all.
speed=0             Ignore this fraction of kmer space (0-15 out of 16) in both
                    reads and reference.  Increases speed and reduces memory.
Note: Do not use more than one of 'speed', 'qskip', and 'rskip'.

Trimming/Filtering/Masking parameters:
Note - if neither ktrim nor kmask is set, the default behavior is kfilter.
All three are mutually exclusive.

ktrim=f             Trim reads to remove bases matching reference kmers.
                    Values: 
                            f (don't trim), 
                            r (trim to the right), 
                            l (trim to the left)
kmask=f             Replace bases matching ref kmers with another symbol.
                    Allows any non-whitespace character other than t or f,
                    and processes short kmers on both ends.  'kmask=lc' will
                    convert masked bases to lowercase.
mink=0              Look for shorter kmers at read tips down to this length, 
                    when k-trimming or masking.  0 means disabled.  Enabling
                    this will disable maskmiddle.
qtrim=f             Trim read ends to remove bases with quality below trimq.
                    Performed AFTER looking for kmers.
                    Values: 
                            rl (trim both ends), 
                            f (neither end), 
                            r (right end only), 
                            l (left end only),
                            w (sliding window).
trimq=6             Regions with average quality BELOW this will be trimmed.
minlength=10        (ml) Reads shorter than this after trimming will be 
                    discarded.  Pairs will be discarded if both are shorter.
maxlength=          Reads longer than this after trimming will be discarded.
                    Pairs will be discarded only if both are longer.
minavgquality=0     (maq) Reads with average quality (after trimming) below 
                    this will be discarded.
maqb=0              If positive, calculate maq from this many initial bases.
chastityfilter=f    (cf) Discard reads with id containing ' 1:Y:' or ' 2:Y:'.
barcodefilter=f     Remove reads with unexpected barcodes if barcodes is set,
                    or barcodes containing 'N' otherwise.  A barcode must be
                    the last part of the read header.
barcodes=           Comma-delimited list of barcodes or files of barcodes.
maxns=-1            If non-negative, reads with more Ns than this 
                    (after trimming) will be discarded.
otm=f               (outputtrimmedtomatch) Output reads trimmed to shorter 
                    than minlength to outm rather than discarding.
tp=0                (trimpad) Trim this much extra around matching kmers.
tbo=f               (trimbyoverlap) Trim adapters based on where paired 
                    reads overlap.
minoverlap=24       Require this many bases of overlap for overlap detection.
mininsert=25        Require insert size of at least this much for overlap 
                    detection (will automatically set minoverlap too).
tpe=f               (trimpairsevenly) When kmer right-trimming, trim both 
                    reads to the minimum length of either.
forcetrimleft=0     (ftl) If positive, trim bases to the left of this position
                    (exclusive, 0-based).
forcetrimright=0    (ftr) If positive, trim bases to the right of this position
                    (exclusive, 0-based).
forcetrimright2=0   (ftr2) If positive, trim this many bases on the right end.
forcetrimmod=0      (ftm) If positive, right-trim length to be equal to zero,
                    modulo this number.
restrictleft=0      If positive, only look for kmer matches in the 
                    leftmost X bases.
restrictright=0     If positive, only look for kmer matches in the 
                    rightmost X bases.

Entropy/Complexity parameters:

entropy=-1          Set between 0 and 1 to filter reads with entropy below
                    that value.  Higher is more stringent.
entropywindow=50    Calculate entropy using a sliding window of this length.
entropyk=5          Calculate entropy using kmers of this length.
minbasefrequency=0  Discard reads with a minimum base frequency below this.

Java Parameters:

-Xmx                This will be passed to Java to set memory usage, overriding 
                    the program's automatic memory detection. -Xmx20g will 
                    specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  
                    The max is typically 85% of physical memory.

There is a changelog at /bbmap/docs/changelog_bbduk.txt
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
Thanks, Matt
M4TTN is offline   Reply With Quote
Old 03-26-2015, 03:58 PM   #35
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Matt,

If you already used "mink=1", then that's already quite aggressive! But you could still benefit from dropping from "k=25" to "k=23" and increasing from "hdist=1" to "hdist=2". On reflection, though, I don't recommend using the "tpe" flag when reads are different lengths, as in your case - it forces both reads to be trimmed to the same length; but particularly with mink=1 and hdist=1 that's not good, all of your 350bp reads would get trimmed to 250bp! Also, I wrote 32 when I meant 23 above, so I've corrected it - sorry for the confusion!

hdist=2: Allow up to 2 mismatches in the full-size kmers (23-mers).
hdist2=1: Allow up to 1 mismatch in the short kmers used at the end of the read (1-mers through 22-mers).

So overall, because you still have adapters, I would suggest:

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=1 hdist=2 hdist2=1 tbo

I don't normally think there's much point in dropping mink below about 6 unless having 5bp of adapter sequence at the end of your read will cause problems, but with 350bp reads, the end is probably going to be pretty low quality and losing a couple bases shouldn't matter.
Brian Bushnell is offline   Reply With Quote
Old 03-27-2015, 03:10 AM   #36
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Thanks Brian,

I've just reprocessed the fastq files with the modified bbduk parameters that you suggest:

ktrim=r k=23 mink=1 hdist=2 hdist2=1 tbo

Then aligned with Ugene Bowtie2, and opened the SAM file in SeqMonk to visualise "read length" histogram.

As you can see, it has made no difference at all. On the left is the previous parameters I was using, on the right, your suggested parameters.

https://www.dropbox.com/s/u44blbm2pg...right.png?dl=0

Do you have any other ideas that can help to improve the quality of the aligned reads? I am really surprised that with reads of up to 350bp, they are not aligning - even if sequencing error rate is high, I would have thought it very easy to locate the correct genomic region.

I am using the Nexera.fa file as a source of sequences for the adapters (we are using NexteraXT for sample preparation).

Do you think it would be better to arbitrarily clip 25 bp off the 3' end of every read that is shorter than the read length (350bp for Read1 and 250bp for Read 2)?

Perhaps it would be simpler to remove 25bp off the 3' end of very read? So even those without adapter contamination will be clipped, but it should at least remove all adapter sequences off those reads that enter the adapter region.

This should not impact insert length histogram since this is calculated after mapping and is based on the 5' ends of the paired reads (I assume anyway!).

Which parameters in bbduk can be used to trim arbitrary bp off 3' ends?
M4TTN is offline   Reply With Quote
Old 03-27-2015, 04:02 AM   #37
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,089
Default

That is odd that you are not able to align to a reference. What genome is this?
GenoMax is offline   Reply With Quote
Old 03-27-2015, 04:06 AM   #38
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Perhaps I mispoke. I don't know that they are not aligning. Indeed, most of the sequences do align otherwise we wouldn't get a SAM file out. My point was that I would find it odd that low sequence quality and/or residual adapters would make much difference to the mapping ability with reads that are up to 350bp.

Does that make sense or not?

Is there a simple way to determine how many reads did not map in the SAM file?
M4TTN is offline   Reply With Quote
Old 03-27-2015, 05:21 AM   #39
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,089
Default

Quote:
Originally Posted by M4TTN View Post
Perhaps I mispoke. I don't know that they are not aligning. Indeed, most of the sequences do align otherwise we wouldn't get a SAM file out. My point was that I would find it odd that low sequence quality and/or residual adapters would make much difference to the mapping ability with reads that are up to 350bp.

Does that make sense or not?
It could make a difference on what exact parameters you are using and how different the read is compared to reference. bowtie2 should soft clip your reads if you have some contaminants but I have seen people post that this may not work as well as pre-clipped reads.

Since you have BBMap installed you could easily run BBMap alignment and cross-check your results with bowtie2.

Quote:
Originally Posted by M4TTN View Post
Is there a simple way to determine how many reads did not map in the SAM file?
A quick hack would be

Code:
$ cat your_file.sam |  awk -F'\t' '{if ($3 == "*") print $0}' | wc -l
or you can use samtools

Code:
$ samtools view -f 0x4 you_file.sam
GenoMax is offline   Reply With Quote
Old 03-27-2015, 05:44 AM   #40
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

I just used SAMTools and it spat out a huge file. Copied it into Excel (for lack of a better method!) and counted the rows: 103,163

Then I realised that I don't know the total number of reads...

Interestingly, SeqMonk opns the SAM file and says there are: 387,481.

But if I open the file in Qualimap, it states: 923,010 (with 103,136 stated as unmapped).

This seems about right since we multiplexed 24 samples and have about 20M clusters in total.

So Qualimap appears to agree with Samtools.

BUT: the read/insert histogram in Qualimap is broken:

See image: https://www.dropbox.com/s/vj9ruturaj...20OM7.png?dl=0

It has the error where fully overlapping paired reads (I think) are recorded as length=0.

To my mind there is a problem with the SAM specification for when reads fully overlap.

Last edited by M4TTN; 03-27-2015 at 05:47 AM.
M4TTN is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO