SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
454 quality filtering GSCHALLA 454 Pyrosequencing 0 01-30-2012 02:57 PM
Quality Checks (QC) and filtering of NGS reads before further processing Brajbio Bioinformatics 0 05-22-2011 11:38 PM
Filtering on quality Farhat Bioinformatics 4 05-19-2010 06:24 AM
Quality filtering Farhat SOLiD 0 05-13-2010 11:42 PM
SNP calling - is there an accepted Phred quality threshold? Francesco Lescai Bioinformatics 3 04-13-2010 11:51 AM

Reply
 
Thread Tools
Old 11-08-2010, 01:16 PM   #1
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default Accepted practices of NGS quality filtering?

Hi all,

There is a lot of software for performing various forms of quality control and filtering on short read data generated by NGS platforms, but it's harder to find information about what decisions one should make when doing filtering - where to put your thresholds, how much to trim etc.

I have just begun to map >100 bp Illumina reads to a small genome but I seem to have a lot of noise in the data set (both adapter contamination and low quality sequence) resulting in low mapping rates. Quality plummets towards the 3' end, and trimming all reads gives better rates but then of course, you don't want to trim away good sequence.

I understand BWA has a pretty neat approach to trimming reads individually (the -q flag), but there is the question of what value to set for this parameter.

Other approaches, like discarding whole reads exceeding some minimum thresholds, counting ambiguous bases ("N"'s), looking for little windows of poor-quality sequence etc. are also mentioned every now and then and sound good in principle but here too it is hard to know exactly how much of a given approach one should do. Another thought here is that if a read is really bad it probably won't align anyway - but it still feels better to identify and exclude these reads beforehand, right? I also thought about mapping full-length reads, then trimming unmapped reads and trying them again. It seems like there are many different approaches one could take but it's not obvious what would be best.

Since this is such a common task it would seem that some kind of standard practices would start to emerge after a while - atleast there should be a lot of people with experience of these kind of decisions. So does anyone have any opinions on this, or possibly links to other resources on the topic? Many thanks in advance.
gaffa is offline   Reply With Quote
Old 11-10-2010, 06:07 AM   #2
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.
Bruins is offline   Reply With Quote
Old 11-10-2010, 07:35 AM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by Bruins View Post
perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.
Good luck with that! This field is new enough that I don't think anyone has a definitive answer for what you should filter and at what cutoff. If there is a book on this I'm not sure I'd trust it.

I suppose this boils down to there being two kinds of quality problems, either you're making calls with low confidence or you're making correct calls of something you don't want (eg adapters).

For low confidence data you would ideally be able to leave this all in place and have your downstream analysis tools take this into account when analysing it, so mappers and aligners won't care too much if low confidence calls mismatch. This allows you to retain as much information as possible - which should be a good thing. However this falls down if the scores assigned to your calls prove not to be accurate - which is probably the case a lot of the time. This will lead to you ignoring good quality data because of the poor data on the end. We've therefore often decided in our data to trim really poor sequence (normally by truncating a whole run at a particular position) since aligners then have less excuse for getting the mapping wrong.

In some applications (SNP calling, bisulphite seq etc) you may prefer to never have to deal with low confidence calls and so would trim your data at an early stage and just live with the reduced coverage you get, rather than have to worry about dealing with a large number of low confidence predictions later on.

For contaminated data you may not have to worry about the contamination - if you're getting some adapter sequence in your data then it probably won't align to your reference and you can just ignore it. However if you have partial adapter sequence on the end of real data then this could make a mess of alignment where you could make false overlaps between otherwise unrelated sequences. As read lengths get longer you may find that an increasing percentage of your library has some adapter on the end of the reads and it will become more important to remove this to preserve as much data as possible.
simonandrews is offline   Reply With Quote
Old 11-14-2010, 02:17 PM   #4
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.

I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.

It would be interesting if any had any additional figures on what kind of success rate one can expect.
gaffa is offline   Reply With Quote
Old 11-15-2010, 12:16 AM   #5
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by gaffa View Post
Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.
My concern with that approach would be that I might be biasing my results. What if AT rich sequences show poorer quality? Would I introduce a %GC bias by trimming each read individually? If a whole run is becoming poor quality then trimming the whole thing is effectively the same thing as doing a shorter run and I'm happier with that. Also, depending on your downstream analysis it may be trickier to handle runs with variable read lengths. A lot of the stats is easier if you can remove read length as a factor you have to consider.

Quote:
Originally Posted by gaffa View Post
I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.
30-40% mapping is absurdly low for most applications. We generally see mapping efficiency over 70% for RNA-Seq and maybe 60% for ChIP-Seq with 40bp reads (this will be very antibody dependent). In many cases the reads which don't map (at least in our case) are things which aren't present in the assemblies (centromeres, telomeres etc.), or regions which are duplicated with high identity which would require much more sequence to map uniquely. We have a repeat mapping pipeline where we assign reads to repeat classes, and don't care if they map to more than one instance of the class (or even to multiple classes). This lets us look at most of the unmappable data albeit in a slightly different way to the conventionally mapping regions.
simonandrews is offline   Reply With Quote
Old 11-15-2010, 05:17 AM   #6
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

1. BWA's quality trimming method is learned from phred.

2. I have seen quite a lot of papers citing Harismendy et al. (2009). This paper was great at the time of the submission (end of 2008; sequencing should be done earlier), but it is not representative any more. NGS is a fast changing field. Many things have happened in the past two years.
lh3 is offline   Reply With Quote
Old 11-15-2010, 07:43 AM   #7
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

I totally agree with lh3 on the second point. Also, we have seen that for a majority of cases, letting the aligner and variant caller deal with low quality works fine. If Q really goes downhill, then it is usually consistent with all lanes, and using a pre-defined trim of all reads (essentially a shorter run) avoids the bias as Simon mentioned.

FastQC is very useful to summarize all this information
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 11-17-2010, 08:05 AM   #8
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

Thanks for the replies. I wonder if anyone has any opinions on what Q value cutoffs should be used. I've seen the values 15 and 20 thrown around?

One thing that I've been thinking about is to map full reads first, and then take the unmapped ones, trim them from the 3' end and try them again. I haven't seen this approach used but it doesn't seem like it should be that controversial. (Though of course it would be time-consuming).

Last edited by gaffa; 11-17-2010 at 08:09 AM.
gaffa is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO