Unconfigured Ad

**Brian Bushnell** · 11-10-2015, 11:59 AM

I would quality-trim (discarding reads below a certain length after trimming, say, 70bp) or quality-filter. Then discard libraries that are below your coverage target after the operation. For example, using BBDuk on files named "read1.fq.gz" and "read2.fq.gz" and trimming to q15:

bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70

You'll get some result like this:

Code:

Input:                          200000 reads            60200000 bases.
QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
Result:                         191928 reads (95.96%)   40590306 bases (67.43%)

Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.

Alternately, if the only problem is the last 6 bases of read 2, then just trim the last 6 bases of read 2 and keep all of the libraries.

**GenoMax** · 11-10-2015, 12:54 PM

Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

Basically, what you have in hand is a 100 x 94 run.

**bio_informatics** · 11-10-2015, 12:55 PM

Originally posted by Brian Bushnell View Post

bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70

Code:

Input:                          200000 reads            60200000 bases.
QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
Result:                         191928 reads (95.96%)   40590306 bases (67.43%)

Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.

Hi Brian,
Thanks much for your reply, and suggesting bbduk. This is the exact thing I was looking for.

I'll give this a try.

**bio_informatics** · 11-10-2015, 01:00 PM

Originally posted by GenoMax View Post

Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

Basically, what you have in hand is a 100 x 94 run.

Hi GM,
Oh, I see.
I couldn't imagine to that extent. Shall include both reads.

Thank you.

**bio_informatics** · 11-11-2015, 11:25 AM

Originally posted by Brian Bushnell View Post

bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70

Hi Brian,
Thanks again for sharing tool, and command.
I've following output:

[in=R2_001.fastq.gz, out=trimmed.fq.gz, qtrim=r, trimq=25, minlen=70]

Input: 2301451 reads 195623335 bases.
QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
Result: 2043361 reads (88.79%) 172932784 bases (88.40%)

I've following doubt:

1) If I'm understanding output properly, here, 88.79% of the input reads passed my filters (Length 70, and Quality 25).
Kindly correct me if I'm wrong.

**GenoMax** · 11-11-2015, 03:36 PM

Here is what @Brian had to say about this question

Why Input =/= KTrimmed + Result at times?

There are two reasons for that, and they involve the "minlen" parameter and read pairing. For trimming, generally, the result should equal the input unless some reads are trimmed below the minimum length and discarded; you can, for example, have 100% trimmed and a result of 100% as long as they are only trimmed a little bit.

With pairs, it's even more confusing as sometimes you have a pair discarded because one read got trimmed too short, so you can end up with, theoretically, 50% of reads trimmed and a result of 0%.

If you set "minlen=0", then nothing will get discarded on the basis of being too short and result will equal input.

I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.

**bio_informatics** · 11-12-2015, 05:51 AM

Originally posted by GenoMax View Post

Here is what @Brian had to say about this question

I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.

Hi Genomax,
Sorry, I don't understand what it mean.

Let me take baby steps here. Input R2 - lagging strand.

Input: 2301451 reads 195623335 bases.
QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
Result: 2043361 reads (88.79%) 172932784 bases (88.40%)

Qtrimmed (reads) % + Result (reads) % >100

However,
Qtrimmed (bases) % + Result (bases) % =100

I'm confused here.
Were 16.11% of my reads discarded?
If so, then my output should have 83-ish% of result. Why there's a conflict?

References:

BBMap (aligner for DNA/RNAseq) is now open-source and available for download. - SEQanswers

http://seqanswers.com/forums/showthread.php?t=41057&highlight=bbduk

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Yes .. BBMap can do that! - SEQanswers

http://seqanswers.com/forums/showthread.php?t=58221

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**GenoMax** · 11-12-2015, 06:45 AM

Originally posted by bio_informatics View Post

Hi Genomax,
Qtrimmed (reads) % + Result (reads) % >100

However,
Qtrimmed (bases) % + Result (bases) % =100

I'm confused here.
Were 16.11% of my reads discarded?
If so, then my output should have 83-ish% of result. Why there's a conflict?

~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?

**bio_informatics** · 11-12-2015, 07:35 AM

Originally posted by GenoMax View Post

~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?

Hi GM,
Thanks for your reply.
That helps.

But, then why doesn't this apply to the bases? Base %-age seems to add up to 100.
Yes, I had posted output for Read 2 only.

I'm definitely going to use R1 for analysis.

If I include both my reads, with q25 and length 70, output looks like:

Processing time: 4.471 seconds.

Input: 2301450 reads 195623250 bases.
QTrimmed: 459677 reads (19.97%) 51656212 bases (26.41%)
Result: 1701048 reads (73.91%) 143967038 bases (73.59%)

Time: 4.911 seconds.
Reads Processed: 2301k 468.65k reads/sec
Bases Processed: 195m 39.84m bases/sec

The reason I was using R2 only is if Read 2's result (from bbduk) percentage is below 80% all those isolate will be discarded for the analysis.

**bio_informatics** · 11-12-2015, 07:53 AM

Since I'm interested to see how many reads/bases are good. Why would I trim them?
Can I use maq (minimum average quality) argument in bbduk? My goal is to get count for the reads which have quality say xyz.
Sorry, I'm making waters murky. I think it's important for me to understand what to use, when and why.

**GenoMax** · 11-12-2015, 08:20 AM

This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

It is not logical to calculate "average" quality for a read. Is that what you are referring to?

Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)

**bio_informatics** · 11-12-2015, 08:31 AM

Originally posted by GenoMax View Post

This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

It is not logical to calculate "average" quality for a read. Is that what you are referring to?

Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)

Hi,

- Yes, average read quality I referred to.
- Yes, the run stopped after 96 cycles.
- I pulled QC for Read 2, it looks good than I had expected. Nothing below quality 36. There are no regions where quality slacks in middle.

Yes, I'm interested to find out reads which are ugly and then can be sent for re-sequencing.

There's no reason to select Q25.
Q1) Is quality 25 while working with "Read 2" alone is high? Or
while working with both "R1 and R2" is high?

That is the issue, should I pull the bar for result percent below, that is select reads which have passed Q25, L 70? I'm considering to take reads which give 80% in result. (this is while running Read 2 only).

Or

Pull down Quality at which I'm obtaining my result which in this case 25.

Thanks again for your support, and guiding through these discussions.

**GenoMax** · 11-12-2015, 08:46 AM

Based on what you are describing this run seems fully acceptable. Just 6 (or 4 bp) shorter on R2 than you expected, which is no big deal (unless there are NNNNN's at the end of every read in R2, where the cycles failed).

I would say just scan/trim for presence of adapters (and if you must, trim <Q20 bases) and then on towards mapping.

**Brian Bushnell** · 11-12-2015, 10:55 AM

For trimming, Q25 is very high. The more bases you trim, the less accurate your mapping. Even low-quality bases (under Q10) can increase the specificity when mapping. I don't trim above ~Q15 except in very rare circumstances; Q12 is more typical.

To clarify a couple of things:

"maq", or minimum average quality, translates the quality scores to probabilities, averages them, then transforms back to a PHRED-scaled Q-score. So, maq=20 would remove all reads with an average error rate of over 1%, or an average of 1 error per 100bp read. This is applied after trimming, if trimming is performed. Ns are considered to have a 75% error rate, so it's useful to at least trim trailing Ns (trimq=1) if you use maq; otherwise, any read with 4 trailing Ns would be discarded. And its mate would also be discarded.

And - it's important to always trim (and otherwise process) R1 and R2 together, at the same time, in any process that might discard or reorder the reads. Otherwise you will end up with unmatched pairs that won't map correctly.

Lastly, note that you can trim the last 4 bases of read 2 only, and keep the result in the same order as before, like this:

bbduk.sh in=read2.fq out=trimmed2.fq ftr2=4 ordered

In this situation, it would be helpful if you could post a couple charts from one of the bad batches of raw data:

Either post the quality histogram according to Illumina:
bbduk.sh in=read1.fq in2=read2.fq qhist=qhist.txt bhist=bhist.txt

Or, even better, post the quality histograms as determined by mapping:
bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fasta qhist=qhist.txt bhist=bhist.txt mhist=mhist.txt

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Today, 05:37 AM	0 responses 1 view 0 reactions	Last Post by SEQadmin2 Today, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 109 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Sequence Quality for large number of fastqs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News