SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
gene annotation for a large number of genes Tom2013 Bioinformatics 5 10-21-2015 07:33 AM
DEXSeq with large number of samples jcaswell Bioinformatics 13 04-27-2015 04:16 AM
annotating large number of eukaryotic contigs fznajar Bioinformatics 4 10-17-2014 11:42 AM
Large number of MACS negative peaks feralBiologist Bioinformatics 0 12-12-2013 05:30 PM
multiplexing large number of samples lilletine Sample Prep / Library Generation 1 04-17-2013 02:33 AM

Reply
 
Thread Tools
Old 11-10-2015, 11:45 AM   #1
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default Sequence Quality for large number of fastqs

Hi Members,

Organism: S. aureus
I've 100 isolates' Illumina WGS. That is 2*100 fastq.gz files. Read length 100.
Unfortunately, the run failed before completing last 6 cycles.
Sequencing centre has de-multiplexed data, and we've it in our court now.

My challenge here is to strike out isolates which have poor quality. I'm interested for R2 only.

One approach I know is:
Get fast-qc report for all the R2, and visually decide(pass,fail) respective data one by one.
Then again that would be qualitative, and thus I do not want to go for this approach (for the time being).
Also, poor is a subjective term.

Let's say, I'd like to have isolates with an average of 25 coverage. Is there way to do so?

I'm sure this situation is very common with other labs.

Looking forward for guidance!

Last edited by bio_informatics; 11-12-2015 at 07:55 AM.
bio_informatics is offline   Reply With Quote
Old 11-10-2015, 11:59 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I would quality-trim (discarding reads below a certain length after trimming, say, 70bp) or quality-filter. Then discard libraries that are below your coverage target after the operation. For example, using BBDuk on files named "read1.fq.gz" and "read2.fq.gz" and trimming to q15:

bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70

You'll get some result like this:

Code:
Input:                          200000 reads            60200000 bases.
QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
Result:                         191928 reads (95.96%)   40590306 bases (67.43%)
Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.

Alternately, if the only problem is the last 6 bases of read 2, then just trim the last 6 bases of read 2 and keep all of the libraries.
Brian Bushnell is offline   Reply With Quote
Old 11-10-2015, 12:54 PM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,815
Default

Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

Basically, what you have in hand is a 100 x 94 run.
GenoMax is offline   Reply With Quote
Old 11-10-2015, 12:55 PM   #4
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Quote:
Originally Posted by Brian Bushnell View Post

bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70


Code:
Input:                          200000 reads            60200000 bases.
QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
Result:                         191928 reads (95.96%)   40590306 bases (67.43%)
Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.
Hi Brian,
Thanks much for your reply, and suggesting bbduk. This is the exact thing I was looking for.
I'll give this a try.
bio_informatics is offline   Reply With Quote
Old 11-10-2015, 01:00 PM   #5
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Quote:
Originally Posted by GenoMax View Post
Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

Basically, what you have in hand is a 100 x 94 run.
Hi GM,
Oh, I see.
I couldn't imagine to that extent. Shall include both reads.

Thank you.
bio_informatics is offline   Reply With Quote
Old 11-11-2015, 11:25 AM   #6
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Quote:
Originally Posted by Brian Bushnell View Post
bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70
Hi Brian,
Thanks again for sharing tool, and command.
I've following output:

Quote:
[in=R2_001.fastq.gz, out=trimmed.fq.gz, qtrim=r, trimq=25, minlen=70]

Input: 2301451 reads 195623335 bases.
QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
Result: 2043361 reads (88.79%) 172932784 bases (88.40%)
I've following doubt:



1) If I'm understanding output properly, here, 88.79% of the input reads passed my filters (Length 70, and Quality 25).
Kindly correct me if I'm wrong.

Last edited by bio_informatics; 11-11-2015 at 11:32 AM.
bio_informatics is offline   Reply With Quote
Old 11-11-2015, 03:36 PM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,815
Default

Here is what @Brian had to say about this question

Quote:
Why Input =/= KTrimmed + Result at times?

There are two reasons for that, and they involve the "minlen" parameter and read pairing. For trimming, generally, the result should equal the input unless some reads are trimmed below the minimum length and discarded; you can, for example, have 100% trimmed and a result of 100% as long as they are only trimmed a little bit.

With pairs, it's even more confusing as sometimes you have a pair discarded because one read got trimmed too short, so you can end up with, theoretically, 50% of reads trimmed and a result of 0%.

If you set "minlen=0", then nothing will get discarded on the basis of being too short and result will equal input.
I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.
GenoMax is offline   Reply With Quote
Old 11-12-2015, 05:51 AM   #8
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Thumbs up

Quote:
Originally Posted by GenoMax View Post
Here is what @Brian had to say about this question


I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.
Hi Genomax,
Sorry, I don't understand what it mean.
Let me take baby steps here. Input R2 - lagging strand.

Quote:

Input: 2301451 reads 195623335 bases.
QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
Result: 2043361 reads (88.79%) 172932784 bases (88.40%)
Qtrimmed (reads) % + Result (reads) % >100

However,
Qtrimmed (bases) % + Result (bases) % =100

I'm confused here.
Were 16.11% of my reads discarded?
If so, then my output should have 83-ish% of result. Why there's a conflict?

References:
http://seqanswers.com/forums/showthr...ighlight=bbduk

http://seqanswers.com/forums/showthread.php?t=58221
bio_informatics is offline   Reply With Quote
Old 11-12-2015, 06:45 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,815
Default

Quote:
Originally Posted by bio_informatics View Post
Hi Genomax,
Qtrimmed (reads) % + Result (reads) % >100

However,
Qtrimmed (bases) % + Result (bases) % =100

I'm confused here.
Were 16.11% of my reads discarded?
If so, then my output should have 83-ish% of result. Why there's a conflict?
~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?
GenoMax is offline   Reply With Quote
Old 11-12-2015, 07:35 AM   #10
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Quote:
Originally Posted by GenoMax View Post
~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?
Hi GM,
Thanks for your reply.
That helps.

But, then why doesn't this apply to the bases? Base %-age seems to add up to 100.
Yes, I had posted output for Read 2 only.

I'm definitely going to use R1 for analysis.

If I include both my reads, with q25 and length 70, output looks like:

Quote:
Processing time: 4.471 seconds.

Input: 2301450 reads 195623250 bases.
QTrimmed: 459677 reads (19.97%) 51656212 bases (26.41%)
Result: 1701048 reads (73.91%) 143967038 bases (73.59%)

Time: 4.911 seconds.
Reads Processed: 2301k 468.65k reads/sec
Bases Processed: 195m 39.84m bases/sec
The reason I was using R2 only is if Read 2's result (from bbduk) percentage is below 80% all those isolate will be discarded for the analysis.
bio_informatics is offline   Reply With Quote
Old 11-12-2015, 07:53 AM   #11
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Since I'm interested to see how many reads/bases are good. Why would I trim them?
Can I use maq (minimum average quality) argument in bbduk? My goal is to get count for the reads which have quality say xyz.
Sorry, I'm making waters murky. I think it's important for me to understand what to use, when and why.
bio_informatics is offline   Reply With Quote
Old 11-12-2015, 08:20 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,815
Default

This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

It is not logical to calculate "average" quality for a read. Is that what you are referring to?

Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)
GenoMax is offline   Reply With Quote
Old 11-12-2015, 08:31 AM   #13
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Quote:
Originally Posted by GenoMax View Post
This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

It is not logical to calculate "average" quality for a read. Is that what you are referring to?

Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)
Hi,

- Yes, average read quality I referred to.
- Yes, the run stopped after 96 cycles.
- I pulled QC for Read 2, it looks good than I had expected. Nothing below quality 36. There are no regions where quality slacks in middle.

Yes, I'm interested to find out reads which are ugly and then can be sent for re-sequencing.

There's no reason to select Q25.
Q1) Is quality 25 while working with "Read 2" alone is high? Or
while working with both "R1 and R2" is high?

That is the issue, should I pull the bar for result percent below, that is select reads which have passed Q25, L 70? I'm considering to take reads which give 80% in result. (this is while running Read 2 only).

Or

Pull down Quality at which I'm obtaining my result which in this case 25.

Thanks again for your support, and guiding through these discussions.

Last edited by bio_informatics; 11-12-2015 at 08:43 AM.
bio_informatics is offline   Reply With Quote
Old 11-12-2015, 08:46 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,815
Default

Based on what you are describing this run seems fully acceptable. Just 6 (or 4 bp) shorter on R2 than you expected, which is no big deal (unless there are NNNNN's at the end of every read in R2, where the cycles failed).

I would say just scan/trim for presence of adapters (and if you must, trim <Q20 bases) and then on towards mapping.
GenoMax is offline   Reply With Quote
Old 11-12-2015, 10:55 AM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

For trimming, Q25 is very high. The more bases you trim, the less accurate your mapping. Even low-quality bases (under Q10) can increase the specificity when mapping. I don't trim above ~Q15 except in very rare circumstances; Q12 is more typical.

To clarify a couple of things:

"maq", or minimum average quality, translates the quality scores to probabilities, averages them, then transforms back to a PHRED-scaled Q-score. So, maq=20 would remove all reads with an average error rate of over 1%, or an average of 1 error per 100bp read. This is applied after trimming, if trimming is performed. Ns are considered to have a 75% error rate, so it's useful to at least trim trailing Ns (trimq=1) if you use maq; otherwise, any read with 4 trailing Ns would be discarded. And its mate would also be discarded.

And - it's important to always trim (and otherwise process) R1 and R2 together, at the same time, in any process that might discard or reorder the reads. Otherwise you will end up with unmatched pairs that won't map correctly.

Lastly, note that you can trim the last 4 bases of read 2 only, and keep the result in the same order as before, like this:

bbduk.sh in=read2.fq out=trimmed2.fq ftr2=4 ordered

In this situation, it would be helpful if you could post a couple charts from one of the bad batches of raw data:

Either post the quality histogram according to Illumina:
bbduk.sh in=read1.fq in2=read2.fq qhist=qhist.txt bhist=bhist.txt

Or, even better, post the quality histograms as determined by mapping:
bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fasta qhist=qhist.txt bhist=bhist.txt mhist=mhist.txt
Brian Bushnell is offline   Reply With Quote
Reply

Tags
fastqc

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:36 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO