Seqanswers Leaderboard Ad

**M4TTN** · 02-06-2015, 08:39 AM

Thanks GenoMax and Brian. The community support here is excellent.

I've not had a chance to reurun the analyses today, but when we do, I'll report back.

**M4TTN** · 03-26-2015, 08:37 AM

Hello again...

I've had some time to get back to playing with bbmerge/duk etc with some new datasets created on the MiSeq, and I am hoping for some input/explanation.

We are trying to push the insert size a little to maximise the value of 600bp sequencing. Previously we are getting average inserts of below 300bp.

In the last run, we used less Ampure beads, and also did a titration on a set of samples with increasing DNA input amount into the NexteraXT reaction. We then sequenced with 1x350 + 1x250bp.

We used Bowtie2 to align, then opened the SAM files in SeqMonk in order to plot the read/insert length histogram.

For unknown reasons, SeqMonk displays a really odd distribution of the new file - with very few reads below 350 bp in length. (See first image, left).

We are pretty sure this is just an error in reading the SAM alignment file because we have seen similar errors from other analysers where whenever the read lengths fully overlap you get a negative TLEN score, resulting in reads that are not plotted on the histogram. However, previously SeqMonk was one of the programs that did not have issues (at least with 2x300bp sequencing).

Now, to try to solve the problem, and to improve the data fed to Bowtie2, I used bbduk to trim and fastq files using the following code:

./bbduk.sh -Xmx1g in1=/Desktop/TRS10B248_S27_L001_R1_001.fastq in2=/Desktop/TRS10B248_S27_L001_R2_001.fastq out1=/Desktop/TRS10B248_R1.fastq out2=/Desktop/TRS10B248_R2.fastq tbo ktrim=r ref=/Downloads/bbmap/resources/nextera.fa.gz k=25 mink=1 hdist=1

Then did Bowtie2 alignment in Ugene and opened the SAM file in SeqMonk.

The resulting histogram is much better, but still has lots of weird peaks in it. (See first image, right)

Finally, we found that using input DNA of 1ng vs input DNA of 12ng had almost no discernible effect on the insert size distribution (see second image).

Can anyone help to explain what causes programmes to:

1. Misinterpret the SAM file insert lengths?
2. Create the weird peaks seen in the second histogram?

Links to images here:

Raw left and trimmed right.png

https://www.dropbox.com/s/hrig4gaxoj0vlhd/Raw%20left%20and%20trimmed%20right.png?dl=0

Shared with Dropbox

1ng top vs 12ng bottom.png

https://www.dropbox.com/s/c1obrn8n3t9gi8y/1ng%20top%20vs%2012ng%20bottom.png?dl=0

Shared with Dropbox

Note: I have also used bbmerge to get insert lengths, but it only detects about 30% overlapping reads in the new samples (versus about 70% in the previous samples purified using more AMpure beads), so bbmerge is unable to calculate the lengths of the others. However, of the reads it does merge, the distribution is fairly curved/continuous, without any weird peaks (see last image).

bbmerged histogram.pdf

https://www.dropbox.com/s/9ndhqt4upgpdciq/bbmerged%20histogram.pdf?dl=0

Shared with Dropbox

**Brian Bushnell** · 03-26-2015, 02:13 PM

I just posted a new version of BBMerge (in BBMap 34.72) that has been updated to have a substantially higher merge rate with very long overlaps with a high error rate. For extremely long reads (like 350bp), or in fact whenever you are using it to generate an insert size histogram, I recommend you use the new version and run with the "vloose" flag.

Insert sizes from overlapping will, of course, drop to zero before you get to the sum of the read lengths. But insert sizes by mapping are also inaccurate for another reason, and may drop to zero once the insert size drops below the length of your longer read, because the presence of adapter sequence will prevent reads from mapping (unless you allow clipped local alignments). There are also some programs that calculate insert size incorrectly for reads with insert size shorter than read length. Because you have a huge spike at around 300bp, I believe what you're seeing is probably both of these combined - bowtie2's mapping rate is rapidly dropping once insert size goes below 350bp due to increasing adapter sequence, and SeqMonk is incorrectly calculating the insert size of reads that are mapped with insert sizes shorter than read length; those two things will combine to give a false peak at right around read length. And, that's why the insert size histogram looks better after trimming - but there's still a peak present, presumably because BBDuk missed some of the adapters. If you did not already do this, I encourage you to do the adapter trimming with these flags, to be more thorough:

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=2 hdist2=1 tbo

That should get virtually all of the adapter sequence.

I don't have any ideas about the spikes between 500bp and 600bp, or why they show up in the mapping graph but not the overlap graph, unless they are due to a misassembly in the reference.

**M4TTN** · 03-26-2015, 02:39 PM

Thanks Brian.

Will those options work better than what I already used?:

tbo ktrim=r k=25 mink=1 hdist=1

Are you able to explain what hdist=2 and hidst2=1 do?

I've downloaded the new package, but I cannot see a description of these options listed when I type ./bbduk.sh in terminal:

Code:

Written by Brian Bushnell
Last modified March 22, 2015

Description:  Compares reads to the kmers in a reference dataset, optionally 
allowing an edit distance. Splits the reads into two outputs - those that 
match the reference, and those that don't. Can also trim (remove) the matching 
parts of the reads rather than binning the reads.

Usage:  bbduk.sh in=<input file> out=<output file> ref=<contaminant files>

Input may be stdin or a fasta, fastq, or sam file, compressed or uncompressed.
Output may be stdout or a file.
If you pipe via stdin/stdout, please include the file type; e.g. for gzipped 
fasta input, set in=stdin.fa.gz

Optional parameters (and their defaults)

Input parameters:

in=<file>           Main input. in=stdin.fq will pipe from stdin.
in2=<file>          Input for 2nd read of pairs in a different file.
ref=<file,file>     Comma-delimited list of reference files.
literal=<seq,seq>   Comma-delimited list of literal reference sequences.
touppercase=f       (tuc) Change all bases upper-case.
interleaved=auto    (int) t/f overrides interleaved autodetection.
qin=auto            Input quality offset: 33 (Sanger), 64, or auto.
reads=-1            If positive, quit after processing X reads or pairs.
copyundefined=f     (cu) Process non-AGCT IUPAC reference bases by making all
                    possible unambiguous copies.  Intended for short motifs
                    or adapter barcodes, as time/memory use is exponential.

Output parameters:

out=<file>          (outnonmatch) Write reads here that do not contain 
                    kmers matching the database.  'out=stdout.fq' will pipe 
                    to standard out.
out2=<file>         (outnonmatch2) Use this to write 2nd read of pairs to a 
                    different file.
outm=<file>         (outmatch) Write reads here that contain kmers matching
                    the database.
outm2=<file>        (outmatch2) Use this to write 2nd read of pairs to a 
                    different file.
outs=<file>         (outsingle) Use this to write singleton reads whose mate 
                    was trimmed shorter than minlen.
stats=<file>        Write statistics about which contamininants were detected.
refstats=<file>     Write statistics on a per-reference-file basis.
rpkm=<file>         Write RPKM for each reference sequence (for RNA-seq).
dump=<file>         Dump kmer tables to a file, in fasta format.
duk=<file>          Write statistics in duk's format.
nzo=t               Only write statistics about ref sequences with nonzero hits.
overwrite=t         (ow) Grant permission to overwrite files.
showspeed=t         (ss) 'f' suppresses display of processing speed.
ziplevel=2          (zl) Compression level; 1 (min) through 9 (max).
fastawrap=80        Length of lines in fasta output.
qout=auto           Output quality offset: 33 (Sanger), 64, or auto.
statscolumns=3      (cols) Number of columns for stats output, 3 or 5.
                    5 includes base counts.
rename=f            Rename reads to indicate which sequences they matched.
refnames=f          Use names of reference files rather than scaffold IDs.
trd=f               Truncate read and ref names at the first whitespace.
ordered=f           Set to true to output reads in same order as input.

Histogram output parameters:

bhist=<file>        Base composition histogram by position.
qhist=<file>        Quality histogram by position.
aqhist=<file>       Histogram of average read quality.
bqhist=<file>       Quality histogram designed for box plots.
lhist=<file>        Read length histogram.
gchist=<file>       Read GC content histogram.
gcbins=100          Number gchist bins.  Set to 'auto' to use read length.

Histograms for sam files only (requires sam format 1.4 or higher):

ehist=<file>        Errors-per-read histogram.
qahist=<file>       Quality accuracy histogram of error rates versus quality 
                    score.
indelhist=<file>    Indel length histogram.
mhist=<file>        Histogram of match, sub, del, and ins rates by read location.
idhist=<file>       Histogram of read count versus percent identity.
idbins=100          Number idhist bins.  Set to 'auto' to use read length.

Processing parameters:

k=27                Kmer length used for finding contaminants.  Contaminants 
                    shorter than k will not be found.  k must be at least 1.
rcomp=t             Look for reverse-complements of kmers in addition to 
                    forward kmers.
maskmiddle=t        (mm) Treat the middle base of a kmer as a wildcard, to 
                    increase sensitivity in the presence of errors.
maxbadkmers=0       (mbk) Reads with more than this many contaminant kmers 
                    will be discarded.
hammingdistance=0   (hdist) Maximum Hamming distance for ref kmers (subs only).
                    Memory use is proportional to (3*K)^hdist.
qhdist=0            Hamming distance for query kmers; impacts speed, not memory.
editdistance=0      (edist) Maximum edit distance from ref kmers (subs 
                    and indels).  Memory use is proportional to (8*K)^edist.
hammingdistance2=0  (hdist2) Sets hdist for short kmers, when using mink.
qhdist2=0           Sets qhdist for short kmers, when using mink.
editdistance2=0     (edist2) Sets edist for short kmers, when using mink.
forbidn=f           (fn) Forbids matching of read kmers containing N.
                    By default, these will match a reference 'A' if 
                    hdist>0 or edist>0, to increase sensitivity.
removeifeitherbad=t (rieb) Paired reads get sent to 'outmatch' if either is 
                    match (or either is trimmed shorter than minlen).  
                    Set to false to require both.
findbestmatch=f     (fbm) If multiple matches, associate read with sequence 
                    sharing most kmers.  Reduces speed.
skipr1=f            Don't do kmer-based operations on read 1.
skipr2=f            Don't do kmer-based operations on read 2.

Speed and Memory parameters:

threads=auto        (t) Set number of threads to use; default is number of 
                    logical processors.
prealloc=f          Preallocate memory in table.  Allows faster table loading 
                    and more efficient memory usage, for a large reference.
monitor=f           Kill this process if it crashes.  monitor=600,0.01 would 
                    kill after 600 seconds under 1% usage.
minrskip=1          (mns) Force minimal skip interval when indexing reference 
                    kmers.  1 means use all, 2 means use every other kmer, etc.
maxrskip=99         (mxs) Restrict maximal skip interval when indexing 
                    reference kmers. Normally all are used for scaffolds<100kb, 
                    but with longer scaffolds, up to K-1 are skipped.
rskip=              Set both minrskip and maxrskip to the same value.
                    If not set, rskip will vary based on sequence length.
qskip=1             Skip query kmers to increase speed.  1 means use all.
speed=0             Ignore this fraction of kmer space (0-15 out of 16) in both
                    reads and reference.  Increases speed and reduces memory.
Note: Do not use more than one of 'speed', 'qskip', and 'rskip'.

Trimming/Filtering/Masking parameters:
Note - if neither ktrim nor kmask is set, the default behavior is kfilter.
All three are mutually exclusive.

ktrim=f             Trim reads to remove bases matching reference kmers.
                    Values: 
                            f (don't trim), 
                            r (trim to the right), 
                            l (trim to the left)
kmask=f             Replace bases matching ref kmers with another symbol.
                    Allows any non-whitespace character other than t or f,
                    and processes short kmers on both ends.  'kmask=lc' will
                    convert masked bases to lowercase.
mink=0              Look for shorter kmers at read tips down to this length, 
                    when k-trimming or masking.  0 means disabled.  Enabling
                    this will disable maskmiddle.
qtrim=f             Trim read ends to remove bases with quality below trimq.
                    Performed AFTER looking for kmers.
                    Values: 
                            rl (trim both ends), 
                            f (neither end), 
                            r (right end only), 
                            l (left end only),
                            w (sliding window).
trimq=6             Regions with average quality BELOW this will be trimmed.
minlength=10        (ml) Reads shorter than this after trimming will be 
                    discarded.  Pairs will be discarded if both are shorter.
maxlength=          Reads longer than this after trimming will be discarded.
                    Pairs will be discarded only if both are longer.
minavgquality=0     (maq) Reads with average quality (after trimming) below 
                    this will be discarded.
maqb=0              If positive, calculate maq from this many initial bases.
chastityfilter=f    (cf) Discard reads with id containing ' 1:Y:' or ' 2:Y:'.
barcodefilter=f     Remove reads with unexpected barcodes if barcodes is set,
                    or barcodes containing 'N' otherwise.  A barcode must be
                    the last part of the read header.
barcodes=           Comma-delimited list of barcodes or files of barcodes.
maxns=-1            If non-negative, reads with more Ns than this 
                    (after trimming) will be discarded.
otm=f               (outputtrimmedtomatch) Output reads trimmed to shorter 
                    than minlength to outm rather than discarding.
tp=0                (trimpad) Trim this much extra around matching kmers.
tbo=f               (trimbyoverlap) Trim adapters based on where paired 
                    reads overlap.
minoverlap=24       Require this many bases of overlap for overlap detection.
mininsert=25        Require insert size of at least this much for overlap 
                    detection (will automatically set minoverlap too).
tpe=f               (trimpairsevenly) When kmer right-trimming, trim both 
                    reads to the minimum length of either.
forcetrimleft=0     (ftl) If positive, trim bases to the left of this position
                    (exclusive, 0-based).
forcetrimright=0    (ftr) If positive, trim bases to the right of this position
                    (exclusive, 0-based).
forcetrimright2=0   (ftr2) If positive, trim this many bases on the right end.
forcetrimmod=0      (ftm) If positive, right-trim length to be equal to zero,
                    modulo this number.
restrictleft=0      If positive, only look for kmer matches in the 
                    leftmost X bases.
restrictright=0     If positive, only look for kmer matches in the 
                    rightmost X bases.

Entropy/Complexity parameters:

entropy=-1          Set between 0 and 1 to filter reads with entropy below
                    that value.  Higher is more stringent.
entropywindow=50    Calculate entropy using a sliding window of this length.
entropyk=5          Calculate entropy using kmers of this length.
minbasefrequency=0  Discard reads with a minimum base frequency below this.

Java Parameters:

-Xmx                This will be passed to Java to set memory usage, overriding 
                    the program's automatic memory detection. -Xmx20g will 
                    specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  
                    The max is typically 85% of physical memory.

There is a changelog at /bbmap/docs/changelog_bbduk.txt
Please contact Brian Bushnell at [email protected] if you encounter any problems.

Thanks, Matt

**Brian Bushnell** · 03-26-2015, 02:58 PM

Hi Matt,

If you already used "mink=1", then that's already quite aggressive! But you could still benefit from dropping from "k=25" to "k=23" and increasing from "hdist=1" to "hdist=2". On reflection, though, I don't recommend using the "tpe" flag when reads are different lengths, as in your case - it forces both reads to be trimmed to the same length; but particularly with mink=1 and hdist=1 that's not good, all of your 350bp reads would get trimmed to 250bp! Also, I wrote 32 when I meant 23 above, so I've corrected it - sorry for the confusion!

hdist=2: Allow up to 2 mismatches in the full-size kmers (23-mers).
hdist2=1: Allow up to 1 mismatch in the short kmers used at the end of the read (1-mers through 22-mers).

So overall, because you still have adapters, I would suggest:

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=1 hdist=2 hdist2=1 tbo

I don't normally think there's much point in dropping mink below about 6 unless having 5bp of adapter sequence at the end of your read will cause problems, but with 350bp reads, the end is probably going to be pretty low quality and losing a couple bases shouldn't matter.

**M4TTN** · 03-27-2015, 02:10 AM

Thanks Brian,

I've just reprocessed the fastq files with the modified bbduk parameters that you suggest:

ktrim=r k=23 mink=1 hdist=2 hdist2=1 tbo

Then aligned with Ugene Bowtie2, and opened the SAM file in SeqMonk to visualise "read length" histogram.

As you can see, it has made no difference at all. On the left is the previous parameters I was using, on the right, your suggested parameters.

Original left vs new right.png

https://www.dropbox.com/s/u44blbm2pgknz78/Original%20left%20vs%20new%20right.png?dl=0

Shared with Dropbox

Do you have any other ideas that can help to improve the quality of the aligned reads? I am really surprised that with reads of up to 350bp, they are not aligning - even if sequencing error rate is high, I would have thought it very easy to locate the correct genomic region.

I am using the Nexera.fa file as a source of sequences for the adapters (we are using NexteraXT for sample preparation).

Do you think it would be better to arbitrarily clip 25 bp off the 3' end of every read that is shorter than the read length (350bp for Read1 and 250bp for Read 2)?

Perhaps it would be simpler to remove 25bp off the 3' end of very read? So even those without adapter contamination will be clipped, but it should at least remove all adapter sequences off those reads that enter the adapter region.

This should not impact insert length histogram since this is calculated after mapping and is based on the 5' ends of the paired reads (I assume anyway!).

Which parameters in bbduk can be used to trim arbitrary bp off 3' ends?

**GenoMax** · 03-27-2015, 03:02 AM

That is odd that you are not able to align to a reference. What genome is this?

**M4TTN** · 03-27-2015, 03:06 AM

Perhaps I mispoke. I don't know that they are not aligning. Indeed, most of the sequences do align otherwise we wouldn't get a SAM file out. My point was that I would find it odd that low sequence quality and/or residual adapters would make much difference to the mapping ability with reads that are up to 350bp.

Does that make sense or not?

Is there a simple way to determine how many reads did not map in the SAM file?

**GenoMax** · 03-27-2015, 04:21 AM

Originally posted by M4TTN View Post

Perhaps I mispoke. I don't know that they are not aligning. Indeed, most of the sequences do align otherwise we wouldn't get a SAM file out. My point was that I would find it odd that low sequence quality and/or residual adapters would make much difference to the mapping ability with reads that are up to 350bp.

Does that make sense or not?

It could make a difference on what exact parameters you are using and how different the read is compared to reference. bowtie2 should soft clip your reads if you have some contaminants but I have seen people post that this may not work as well as pre-clipped reads.

Since you have BBMap installed you could easily run BBMap alignment and cross-check your results with bowtie2.

Originally posted by M4TTN View Post

Is there a simple way to determine how many reads did not map in the SAM file?

A quick hack would be

Code:

$ cat your_file.sam |  awk -F'\t' '{if ($3 == "*") print $0}' | wc -l

or you can use samtools

Code:

$ samtools view -f 0x4 you_file.sam

**M4TTN** · 03-27-2015, 04:44 AM

I just used SAMTools and it spat out a huge file. Copied it into Excel (for lack of a better method!) and counted the rows: 103,163

Then I realised that I don't know the total number of reads...

Interestingly, SeqMonk opns the SAM file and says there are: 387,481.

But if I open the file in Qualimap, it states: 923,010 (with 103,136 stated as unmapped).

This seems about right since we multiplexed 24 samples and have about 20M clusters in total.

So Qualimap appears to agree with Samtools.

BUT: the read/insert histogram in Qualimap is broken:

See image: https://www.dropbox.com/s/vj9ruturaj...20OM7.png?dl=0

It has the error where fully overlapping paired reads (I think) are recorded as length=0.

To my mind there is a problem with the SAM specification for when reads fully overlap.

**GenoMax** · 03-27-2015, 04:54 AM

Do this to avoid the excel part

Code:

$ samtools view -f 0x4 you_file.sam | wc -l

That should just spit a number out.

If R1/R2 reads are overlapping then you should consider using BBmerge.sh (or FLASH) to get a collapsed representation of the fragment before doing the mapping.

Total number of reads in your original/trimmed file can be easily found by FastQC (detailed stats should also be in the output of the bbduk.sh log).

**GenoMax** · 03-27-2015, 05:01 AM

Originally posted by M4TTN View Post

I just used SAMTools and it spat out a huge file. Copied it into Excel (for lack of a better method!) and counted the rows: 103,163
But if I open the file in Qualimap, it states: 923,010 (with 103,136 stated as unmapped).

This is for one sample? That does not look too bad then. You have about 90% reads mapping with 10% unmapped.

**M4TTN** · 03-27-2015, 06:35 AM

Here are the stats from Qualimap:

OM7 trimmed summary.png

https://www.dropbox.com/s/s6aqmqhemej7emv/OM7%20trimmed%20summary.png?dl=0

Shared with Dropbox

If I am reading this correctly, the total mapped reads includes those without a matched pair (listed as singletons = 24,393).

What surprises me somewhat is why 10% are not mapping at all (neither the Read1 nor Read2 mapping). Would trimming the file based on quality help? Or would slicing into smaller reads help?

I am also really surprised by the mapped duplication rate of 24.57%. This literally means that a quarter of our clusters were made of identical molecules? I am amazed given that this is whole genome sequencing of Nextera fragmented DNA.

Or is it due to the complete overlap of 25% of our reads? This seems more likely, so I hope it is that!

**M4TTN** · 03-27-2015, 06:43 AM

I just had a thought: Presumably, if I arbitrarily trim 1-2 bases off EVERY read 3' end, there will be no pairs that fully overlap, and thus the TLEN column in the SAM file should no longer be broken/misinterpreted by Qualimap.

Would that work do you think?

**M4TTN** · 03-27-2015, 08:14 AM

Hmmm.. a quick manual BLAST alignment of half a dozen of the unmapped sequences spat out by SamTools reveals that they map to the yeast 2 micron plasmid!

If this is representative sample, it suggests that we have about 200 copies of the 2 micron plasmid contaminating our yeast genomic DNA preps!

Going forward, we're going to try to add the 2 micron sequence to our reference file for mapping, so that we can at least monitor it correctly. But ideally, it would be good to rid our gDNA preps of the plasmid as much as possible, so I'll look into some protocols to see if it is possible to selectively partition the gDNA and 2 micron.

Finally: I repeated the alignment using BWA (versus Bowtie2). The number of aligned reads was much lower, however, it was very clear that the odd peaks around 250 and 350 bp and around 500 bp are not present in a BWA aligned file. I assume this is due to the reduced stringency of alignment by Bowtie2 allowing alignment of reads with adapters/mismatches in them.

Bowtie2 is at the top, BWA at the bottom, both are aligned using UGene from bbduk trimmed fastq files.

Bowtie2 top vs BWA bottom.png

https://www.dropbox.com/s/0sh3rehjv8h534y/Bowtie2%20top%20vs%20BWA%20bottom.png?dl=0

Shared with Dropbox

What I still don't really understand is why bbduk is not successfully removing any trace of the adapters from the 3' ends of the reads. Thoughts?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News