Seqanswers Leaderboard Ad

**ysnapus** · 12-01-2014, 05:41 PM

Originally posted by Brian Bushnell View Post

Hmmm. Odd; I assume the base frequency drift is some kind of machine issue, and the quality scores of those bases (at the tail of read 2) are inaccurate. But I can't really explain the seeming discrepancy between the bhist and mhist; after quality-trimming, I would expect the base frequencies and mismatches to both flatten out, not just one of them. Unless the reads with the divergent base frequencies are not mapping because (for example) they are something synthetic rather than genomic, though synthetic contaminant molecules don't tend to be so long.

Perhaps I should generate the SAM file from bbmap and then generate the bhist from just the mapped reads in the SAM? Can a bbmap tool generate the bhist from mapped reads of a SAMfile?

Also, the mhist isn't completely flat, it slopes downwards at the ends. Though it looks flatter than the bhist.

Originally posted by Brian Bushnell View Post

Overall, I'm not certain what's going on, or how to fix it, short of forcibly trimming the reads to ~280bp and ~220bp, which may not be necessary or even beneficial.

I could do that, though I guess I'll have to manually set the forcetrimright and forcetrimleft for each file.

**Brian Bushnell** · 12-01-2014, 06:00 PM

Originally posted by ysnapus View Post

Perhaps I should generate the SAM file from bbmap and then generate the bhist from just the mapped reads in the SAM? Can a bbmap tool generate the bhist from mapped reads of a SAMfile?

BBMap can output mapped reads in fastq, which will make it a little easier to keep the pairs straight, then you can calculate the bhist with reformat:

bbmap.sh in=reads.fq outm=mapped.fq outu=unmapped.fq po

You can use the "outm2" and "outu2" flags if the reads are in two files.
The "po" (pairedonly) flag will make reads that did not map paired go to the "outu" stream.

Also, the mhist isn't completely flat, it slopes downwards at the ends. Though it looks flatter than the bhist.

Exactly - I would expect the tail of the mhist plot to deflect at least as much as the bhist plot, if the bases with divergent frequencies are being mapped and are incorrect.

I could do that, though I guess I'll have to manually set the forcetrimright and forcetrimleft for each file.

Depending on your goal, it may not be necessary, even though the graphs look ugly. Most programs can tolerate some degree of error; trimming becomes more important when that error is biased such that it can change a quantitative result, or the amount of error makes you run out of memory, etc. Often the best approach is to try both ways and see if it made a positive impact, though obviously, that's labor-intensive.

**tkc.yang** · 12-03-2014, 11:34 AM

error in trimming

Hi, I recently got RNA-seq fastq files generated from Ion Torrent.

I tried using the BBDuk program to trim the reads on quality, but I ran into this error and I don't know what I did wrong. I'm also very unfamiliar with coding in general.

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in=.\reads.fastq out=Q10_reads.fastq qtrim=rl trimq=10 minlen=20

Exception in thread "main" java.lang.NullPointerException
at jgi.BBDukF.spawnProcessThreads(BBDukF.java:1414)
at jgi.BBDukF.process2(BBDukF.java:780)
at jgi.BBDukF.process(BBDukF.java:716)
at jgi.BBDukF.main(BBDukF.java:65)

Thanks

**Brian Bushnell** · 12-03-2014, 11:41 AM

What version do you have? The current version does not even have anything on line 1414. If you download the current version it should work.

An even easier solution is to use reformat instead, which has the same syntax, since you are not doing any kmer-based operations:

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.ReformatReads in=.\reads.fastq out=Q10_reads.fastq qtrim=rl trimq=10 minlen=20

**tkc.yang** · 12-03-2014, 02:10 PM

I have version 33.92. I'll download the newest version.

But the Reformat worked! Thanks.

**sdriscoll** · 12-09-2014, 03:21 PM

Brian,

Do you foresee making it possible to use kmers longer than 31bp at any point? Just curious.

Also could you add the 'trd' parameter to seal so that when one enables the 'rename' option one gets those nicely trimmed reference names as in bbmap?

Thanks!

**Brian Bushnell** · 12-09-2014, 03:48 PM

I am considering making an assembler based on BBDuk/Seal's data structures. When I do that I will probably support k>31, which will probably percolate back to BBDuk/Seal, but that won't be anytime soon.

BBDuk does allow processing of k>31 in emulated mode when filtering; specifically, if you set k=35, for example, it would require 5 consecutive 31-mer hits to trigger a match. It's not quite the same, but it does greatly increase specificity. Seal lacks this ability because it would be kind of difficult to track the number of consecutive matches for a kmer hit to an arbitrary number of different references simultaneously.

Seal and BBDuk - in the latest version, 34.08 - already support the trd flag, it's just undocumented. I will document it. Between 34.00 and 34.08 I unified a lot of the parsing so there are a lot of BBMap flags that are suddenly now supported by other tools.

**sdriscoll** · 12-16-2014, 11:54 AM

Hi Brian,

I'm using bbduk to trim adapters from a microRNA seq run and in this case I only want to keep reads that were successfully trimmed. My command line is something like this:

Code:

 bbduk.sh in=reads.fastq.gz out=trimmed.fq ref=mirna_adapter.fa ktrim=r k=20 mink=12 hdist=1 minlength=15

bbduk finishes successfully and reports:

Code:

Input:                  	22241069 reads 		1134294519 bases.
KTrimmed:               	16273134 reads (73.17%) 	401433668 bases (35.39%)
Result:                 	21270190 reads (95.63%) 	732860851 bases (64.61%)

If I count the number of reads in my output file, trimmed.fq, I get 21270190, however according to bbduk only 16273134 reads were trimmed. What is going on here?

**Brian Bushnell** · 12-16-2014, 12:36 PM

22241069 reads were processed. Of those, 16273134 had adapter sequence detected, so they were trimmed. The output was 21270190 reads, which includes both trimmed reads (which had adapter detected) and untrimmed reads (which didn't). (22241069 - 21270190) = 970879 reads were discarded because after trimming, they were too short to use (default 10bp).

And that accounts for all the reads. Does that answer your question?

**sdriscoll** · 12-16-2014, 12:37 PM

Yes. What option should I used so that I can get a FASTQ file with ONLY the reads that were trimmed?

**Brian Bushnell** · 12-16-2014, 01:09 PM

Originally posted by sdriscoll View Post

Yes. What option should I used so that I can get a FASTQ file with ONLY the reads that were trimmed?

Oh, oops, I didn't read your question carefully enough. Assuming the reads were the same length going in, you can subsequently filter the output file to retain only reads shorter than the original length:

bbduk.sh in=trimmed.fq outm=filtered.fq minlength=100

You could also use the minlength and outm flags on your original command line instead. "outm" is a little confusing but it gathers all the reads that 'fail', meaning they either match something (presumed to be a contaminant), OR after trimming are shorter than minlength. "outu" (aka "out") gets everything else.

If your input reads were variable-length then this approach would not work, but you could still do it like this:

Code:

bbduk.sh in=reads.fastq.gz [COLOR="DarkRed"][B]mlf=1.0 outm[/B][/COLOR]=trimmed.fq ref=mirna_adapter.fa ktrim=r k=20 mink=12 hdist=1 minlength=15

...where "mlf" or "minlengthfraction" means that reads will fail if they are trimmed to less than that fraction of their original length.

**Brian Bushnell** · 12-16-2014, 01:17 PM

Now that I think about it, I guess the problem is that it is not clear what "out" and "outm" do for BBDuk.

out: Accepts all reads that pass criteria. If BBDuk is used for kmer filtering, that means all reads that do not have kmer matches to the reference. If BBDuk is used for trimming, that means all reads that, after trimming, are longer than a minimum length, whether or not they had any matching kmers.

outm: Accepts all reads that fail criteria and thus would not go to "out". "m" stands for "match", and when kmer filtering it accepts reads with matching kmers, but it also gets reads shorter than minlen (again, the default is 10, so you'd have to set it to 0 to disable this behavior).

**sdriscoll** · 12-16-2014, 01:34 PM

OK, thanks that's useful. I guess logically if I'm also specifying mink=12 and I only want reads that were trimmed I could also use maxlength=[read_length - mink] and then out would contain only reads reads that were trimmed. I didn't notice the mlf option...that should be handy as well.

**sdriscoll** · 12-16-2014, 02:11 PM

For seal, would there be any logical way to add in number of unique hits to the rpkm= output? I'd also be interested to know what it does when a read has equally good matches to two or more target references.

Thanks!

**Brian Bushnell** · 12-16-2014, 02:41 PM

By unique, do you mean "only matching that sequence" or "occurring in a unique position in that sequence"? The latter cannot be done, but the former can be controlled with the "ambig" flag that I forgot to document. It is similar to BBMap's ambig flag. This is the description (which will be present in the next release):

Code:

ambiguous=random    (ambig) Set behavior on ambiguously-mapped reads (with an
                    equal number of kmer matches to multiple sequences).
                         first:  Use the first best-matching sequence.
                         toss:   Consider unmapped.
                         random: Select one best-matching sequence randomly.
                         all:    Use all best-matching sequences.

So, the default is random. With BBMap the default is "first".

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News