SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to best remove low-coverage reads horvathdp Bioinformatics 25 02-10-2015 09:15 AM
Why trim low quality reads tahamasoodi Bioinformatics 12 03-06-2013 04:20 PM
Deleting LOW Quality Reads AdrianP Illumina/Solexa 4 01-29-2013 06:10 AM
dbSNP and low quality reads Philcolson Bioinformatics 0 10-09-2012 05:38 AM
remove illumina low quality reads StephaniePi83 Bioinformatics 1 04-12-2012 03:12 AM

Reply
 
Thread Tools
Old 12-23-2016, 12:44 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default Introducing FilterByTile: Remove Low-Quality Reads Without Adding Bias

I'd like to introduce a new member of the BBMap package, FilterByTile. It's intended to increase the quality of libraries without incurring bias, or to help salvage libraries with major positional quality problems (like flow-cell bubbles).


*Overview*

The quality of Illumina reads is dependent on location in a flowcell. Some areas are in poor optical focus, or have weaker circulation, or air bubbles, can have very low-quality reads that nonetheless pass Illumina’s filter criteria. While these reads usually have below-average quality scores, it requires very aggressive quality-filtering to remove all of the reads with positionally-related low quality. Aggressive quality filtering and trimming can, in turn, cause detrimental impacts on analysis because sequence quality is also sequence-dependent; thus, aggressive filtering can incur bias against extreme-GC portions of the genome, or specific motifs. This may yield poor assemblies, incorrect ploidy calls, bad expression quantification, and similar problems.

FilterByTile is designed to filter low-quality reads based on positional information. By removing only a small fraction of reads - those in the lowest-quality areas of the flowcell – the overall quality of the data can be increased without incurring sequence-specific bias. The default settings of FilterByTile typically remove on the order of 2% of the reads, while reducing the overall error rate by substantially more than 2% (on the order of 10%). Essentially, it gets rid of the worst of the worst.

FilterByTile was originally developed after observing spikes in the kmer-uniqueness plot used to calculate library complexity, in what should have been a monotonically-declining exponential decay curve (generated by bbcalcunique.sh); these spikes corresponded to low-quality locations on the flow-cell. Interestingly, the spikes often have a regular period, indicating a structured pattern such as flow-cell edges, tile edges, or a “streak”. The initial goal of FilterByTile was simply to eliminate these spikes to allow better estimation of library size and complexity, but it can be useful for generally improving library quality as well.


*Notes*

How it works:

Illumina read names contain information about each cluster’s lane, tile, and X,Y coordinates. FilterByTile scans all reads in the file and calculates the average quality score for a given position. Additionally, the average kmer-uniqueness rate is calculated by position; for data with sufficient depth, this can be used as a proxy for error-rate, allowing filtering of data with inaccurate quality scores.

To calculate a useful average quality for a position, sufficient reads are needed. So, reads are aggregated by position into rectangular “micro-tiles”; these micro-tiles are iteratively expanded until the average micro-tile contains at least X reads (default 800). Then, the averages are calculated on a per-micro-tile bases, standard deviations are calculated, and for tiles at least Y standard deviations worse than normal, all reads are discarded together. Thus, smaller micro-tiles allow more precise positional filtering, but larger micro-tiles yield more accurate quality-score averages. Arbitrary shapes such as circles outlining bubbles would be optimal, but there are no plans for this.

How and when to use, or not:

FilterByTile is applicable to any Illumina HiSeq, MiSeq, or NextSeq sequence. Howevver, it depends on large volumes of data for statistics; it’s useless to run it on a set of 4000 reads demultiplexed from a much larger run. In that case, it would be better to use the “dump” flag to dump the statistics from all libraries in the run together, then use the “indump” flag to filter the libraries individually. That way, quality statistics gathered from all reads will be applied to each individual library.

The filtering should be beneficial in most cases - particularly when you want to salvage a library that obviously had bubbles or low-flow streaks in the lane, but also for libraries with no dramatic positional quality issues. However, there are some cases – such as complex metagenomes - in which more coverage is strictly beneficial, so throwing away even low-quality reads is a bad idea. In these cases, or any situation where very low coverage is expected, filtering will often lead to inferior results. With high coverage, FilterByTile should be strictly beneficial.

Read names:

FilterByTile depends on read headers to identify flowcell location. It has been validated with HiSeq, MiSeq, and NextSeq data, but different Illumina demultiplexing/base-calling software versions have different naming conventions, so please contact me if you see Illumina names that it can’t parse. Renamed reads (such as those in the SRA) probably won’t work.

Memory:

FilterByTile should not need too much memory, but if it runs out of memory it will generally be due to calculating kmer uniqueness for a large genome. In this case, the “usekmers=f” flag will ignore kmers and just use quality scores; in that case, it won’t run out of memory.


*Usage Examples*

Single-ended or paired/interleaved files:
Code:
filterbytile.sh in=reads.fq.gz out=filtered.fq.gz
Paired reads in twin files:
Code:
filterbytile.sh in1=r1.fq in2=r2.fq out1=filtered1.fq out2=filtered2.fq

Filtering using a statistical profile from multiple libraries:
Code:
cat *.fastq.gz > all.fq.gz
filterbytile.sh in=all.fq.gz dump=dump.flowcell
filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell

Filtering aggressively (when you know there’s a serious problem):
Code:
filterbytile.sh in=x.fq out=y.fq ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5

Disabling kmer uniqueness to increase speed and decrease memory usage:
Code:
filterbytile.sh in=x.fq out=y.fq usekmers=f
Brian Bushnell is offline   Reply With Quote
Old 12-23-2016, 01:06 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

@Brian: Can this be easily extended to provide "MarkDuplicates" functionality? Currently only Piacard tools does this.

I am referring to optical duplicates that form due to "pad hopping" in case of patterned HiSeq 4000 flowcells. Since you are using smaller rectangular tiles to look at reads in the neighborhood would it be possible to identify clusters that may be optical dups and mark them?

Last edited by GenoMax; 12-23-2016 at 01:11 PM.
GenoMax is offline   Reply With Quote
Old 12-23-2016, 01:14 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
@Brian: Can this be easily extended to provide "MarkDuplicates" functionality? Currently only Picard tools does this.

I am referring to optical duplicates that form due to "pad hopping" in case of patterned HiSeq 4000 flowcells. Since you are using smaller rectangular tiles to look at reads in the neighborhood would it be possible to identify clusters that may be optical dups and mark them?
It sounds like it might (or could) be a straightforward extension, yes. Could require writing temp files if all reads don't fit into memory, though. I'd have to study the optical duplicate problem in detail first because I don't currently know how to identify them confidently.
Brian Bushnell is offline   Reply With Quote
Old 12-23-2016, 01:25 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

If you can look in the positional neighborhood for clusters having identical sequence then those would be it.
GenoMax is offline   Reply With Quote
Old 12-23-2016, 07:53 PM   #5
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 451
Default

GenoMax, my first thought was also that an optical dup filter was needed! I would love to remove optical duplicates as a first step in a pipeline (before mapping). Seems like between clumpify and this filterbytile he almost has it written already. (Sorry Brian, I'm sure it is annoying for us to besiege you with requests and then add how it should be easy to do.)
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 12-29-2016, 01:08 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I decided it would work best to add deduplication features to Clumpify. All approaches work perfectly for error-free reads, but Clumpify is affected slightly more by errors than mapping-based approaches, for situations when it is desirable to remove "duplicates" with mismatches. Here's a comparison showing Clumpify's paired read deduplication compared to DedupeByMapping on some real HiSeq data, allow reads with various number of mismatches to the reference (DedupeByMapping is considered the gold standard in each case, though that does not necessarily mean it is more correct). Clumpify is run with different settings (C and D have higher removal rates because they use 3 passes; A is at default settings).



I also added in the ability to restrict duplicate removal to only clusters within a specific number pixels of each other on the flowcell, to avoid removal of PCR duplicates or coincidental duplicates due to high coverage. In so doing I noticed some interesting things... firstly, that most optical duplicates are on different tiles (inter-tile duplicates) rather than the same tile, and secondly, that in the data I tested, NextSeq has a WAY higher optical duplicate rate (~1%) than HiSeq 2500/1T (0.05%). The way you can distinguish between an inter-tile optical duplicate and a PCR duplicate is that inter-tile optical duplicates will share an X or Y coordinate (within some number of pixels, typically under 40). Intra-tile optical duplicates will of course share both coordinates as well as the tile number.

I'll release this once I'm done testing.
Attached Images
File Type: png DedupeClumpify.png (39.0 KB, 116 views)
Brian Bushnell is offline   Reply With Quote
Old 12-29-2016, 01:57 PM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

Am I reading the graph above correctly in that you were not able to find true optical dups (perfect matches on the read) in data you tested? These should be present in problematic HiSeq 3K, 4K data.

You may also want to grab some HiSeq 4000 (or HiSeq X) data from SRA to test since we expect this to be a problem there.
GenoMax is offline   Reply With Quote
Old 12-29-2016, 02:46 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Genomax,

There were plenty of identical pairs. Everything is scaled to 100% in that graph, but here is the raw data:

Code:
	A	B	C	D	DBM
0	3868	3868	3868	3868	3870
1	6260	6316	6402	6454	6470
2	6534	6562	6720	6728	6746
3	6590	6628	6794	6806	6826
4	6622	6662	6832	6840	6858
5	6652	6690	6864	6874	6888
56% of the optical duplicates are perfectly identical, and those were found without problems. Only the reads with mismatches to each other pose challenges, but most of those were still found as well. I still consider them optical duplicates even though they are not technically identical. I've been kind of struggling with the definition of "optical duplicates", but I will use this:

Code:
Reads originating from the same fragment, called multiple times despite originating from nearly the same physical flowcell location.
Since they reads are called multiple times, they can have different errors despite being the same physical cluster.

Last edited by Brian Bushnell; 12-29-2016 at 02:48 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-04-2017, 02:09 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

OK, the new version of Clumpify is out, adding the "dedupe" and "optical" flags (as well as a few other related flags), so you can do optical or full deduplication. Also related to FilterByTile, BBDuk now has xmin, ymin, xmax, and ymax flags for large-scale location-based read filtering; essentially, you can eliminate tile-edge effects using a bounding box. For our NextSeq I was able to eliminate tile-edge duplicates with "xmin=1600 xmax=26300". There did not seem to be any on the Y edges. But, the exact values may vary by machine or run.

Last edited by Brian Bushnell; 01-04-2017 at 02:11 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-30-2017, 09:35 AM   #10
mcmc
Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 14
Default

Brian et al., would you recommend filterbytile.sh be done first, before adapter trimming & quality filtering?
Thanks,
MC
mcmc is offline   Reply With Quote
Old 01-30-2017, 10:11 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi MC,

FilterByTile should be run on raw data, before anything else that changes or removes any reads. If you do any quality-related filtering or trimming steps before FilterByTile, you will remove some of the lowest-quality reads, which will disrupt the statistics.
Brian Bushnell is offline   Reply With Quote
Old 02-11-2017, 02:52 PM   #12
mcmc
Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 14
Default

Hello Brian,
I get the message "Warning: Zero reads processed." using indump=dump.flowcell. But it looks like making the dump file worked ok (it says it processed 780m reads). Is it safe to ignore this warning?

Thanks,
mcmc
mcmc is offline   Reply With Quote
Old 02-11-2017, 03:10 PM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yeah, sorry about that, it's a known bug when you are using already-created flowcell files. If you do everything in a single pass it won't print that. Do you mind posting the filtering statistics, platform, and read length, just out of curiosity? The defaults generally remove around 2-5% of the reads in my testing, but I've only tested it on HiSeq 2500 and NextSeq data.
Brian Bushnell is offline   Reply With Quote
Old 02-11-2017, 03:58 PM   #14
mcmc
Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 14
Default

Thanks. This was HiSeq 2500 2x250 RapidRun (with both lanes concatenated, which I assume is ok, since you refer to "flowcell" and not "lane"). I used the entire dataset (15 metagenomes) to calc the stats, then used dump.flowcell to filter each sample. Here is one example:

Code:
Flagged 36407 of 519552 micro-tiles, containing 50578608 reads:
0 exceeded uniqueness thresholds.
30332 exceeded quality thresholds.
34084 exceeded error-free probability thresholds.
0 had too few reads to calculate statistics.

Filtering reads:        988.159 seconds.

Time:                           988.671 seconds.

Reads Processed:      56900k    57.55k reads/sec
Bases Processed:      14225m    14.39m bases/sec

Reads Discarded:       3094k    5.438%
Bases Discarded:        773m    5.438%
I used "lowqualityonly=t usekmers=f"

Last edited by mcmc; 02-11-2017 at 04:01 PM.
mcmc is offline   Reply With Quote
Old 02-11-2017, 05:32 PM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Thanks! Yes, FilterByTile processes lanes independently, so you can use them all at once.
Brian Bushnell is offline   Reply With Quote
Old 10-20-2017, 01:27 AM   #16
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

Hi Brian,

I have a couple of questions about this script:

1) does it work by default for patterned flowcells, like the ones for a HiSeq 4000? Or do I need to run it with some specific options, like "xsize" or "ysize"?

2) if I only have access to one lane instead of the whole flowcell, would it also be the way to go to create the "dump" file with the samples in it, and then use this profile to process sample by sample?

I have a special case where Read 1 and Read 2 have different behaviours as well as length patterns (R1=26bp; R2=75bp). In one lane, I already know that the whole TOP surface for R2 (only) failed, having those reads looking like this:

Code:
@K00150:243:HLG7MBBXX:6:1101:26129:1297 2:N:0:NCGCAGAA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###########################################################################
So, out of the 10 Mi reads, half of them looks like this, hence should be discarded. However, I tried running "filterbytile" in five different ways but I'm not getting it to filter out this data. Here is what I've done:

a. Without "dump", using only Read 2, with aggresive filtering

Code:
filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5
Code:
Flagged 3358 of 6903 micro-tiles, containing 4747957 reads:
3346 exceeded uniqueness thresholds.
0 exceeded quality thresholds.
0 exceeded error-free probability thresholds.
81 had too few reads to calculate statistics.

Reads Discarded:           0 	0.000%
Bases Discarded:           0 	0.000%
b. Without "dump", using only Read 2

Code:
filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz
Code:
Flagged 345 of 6903 micro-tiles, containing 144925 reads:
326 exceeded uniqueness thresholds.
0 exceeded quality thresholds.
0 exceeded error-free probability thresholds.
81 had too few reads to calculate statistics.

Reads Discarded:           0 	0.000%
Bases Discarded:           0 	0.000%
c. Without "dump", using both pairs

Code:
filterbytile.sh in1=Sample1.R1.fastq.gz in2=Sample1.R2.fastq.gz out1=filtered.Sample1.R1.fq.gz out2=filtered.Sample1.R2.fq.gz
Code:
Flagged 878 of 12803 micro-tiles, containing 119766 reads:
0 exceeded uniqueness thresholds.
547 exceeded quality thresholds.
856 exceeded error-free probability thresholds.
95 had too few reads to calculate statistics.

Reads Discarded:        119k 	0.612%
Bases Discarded:       6043k 	0.612%
d. With "dump" for the whole lane, using only Read 2

Code:
filterbytile.sh in=all.R2.fq.gz dump=dump.lane.R2
filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz indump=dump.lane.R2
Code:
Flagged 11949 of 350360 micro-tiles, containing 7514071 reads:
11612 exceeded uniqueness thresholds.
0 exceeded quality thresholds.
0 exceeded error-free probability thresholds.
882 had too few reads to calculate statistics.

Reads Discarded:           0 	0.000%
Bases Discarded:           0 	0.000%
e. With "dump" for the whole lane, using both pairs

Code:
filterbytile.sh in1=all.R1.fq.gz in2=all.R2.fq.gz dump=dump.lane
filterbytile.sh in1=Sample1.R1.fastq.gz in2=Sample1.R2.fastq.gz out1=filtered.Sample1.R1.fq.gz out2=filtered.Sample1.R2.fq.gz indump=dump.lane
Code:
Flagged 16907 of 691597 micro-tiles, containing 5698570 reads:
0 exceeded uniqueness thresholds.
10436 exceeded quality thresholds.
16794 exceeded error-free probability thresholds.
1192 had too few reads to calculate statistics.

Reads Discarded:        173k 	0.886%
Bases Discarded:       8742k 	0.886%

Maybe this is not an issue addressed by the script? Are you considering both surfaces as part of the same tile? Or are you treating them differently?

Alternatively, I've tried filtering the data by quality using "bbduk", but from the two set ups, only one of them work and I'm wondering if that might be a bug?

i. Filtering using "maq" only on Read 2 [didn't work, weird behavior]

Code:
bbduk.sh qtrim=f maq=10 in=Sample1.R2.fq out=filtered.Sample1.R2.fq
Code:
Input:                  	9773111 reads 		732983325 bases.
Low quality discards:   	0 reads (0.00%) 	0 bases (0.00%)
Total Removed:          	0 reads (0.00%) 	0 bases (0.00%)
Result:                 	9773111 reads (100.00%) 	732983325 bases (100.00%)
First I thought the filtering was not active because in the documentation says it's applied "after" trimming. However, if I specified a big "maq" number, like 30, it is doing something:

Code:
Input:                  	9773111 reads 		732983325 bases.
Low quality discards:   	2428888 reads (24.85%) 	182166600 bases (24.85%)
Total Removed:          	2428888 reads (24.85%) 	182166600 bases (24.85%)
Result:                 	7344223 reads (75.15%) 	550816725 bases (75.15%)
Is this a bug, then? Could this option be applied when no trimming is defined? That is sometimes a way to go, specially when using alignment software capable of soft-masking ends.

ii. Filtering using "maxns" only on Read 2 [worked as a charm]

Code:
bbduk.sh qtrim=f maxns=10 in=Sample1.R2.fq out=filtered.Sample1.R2.fq
Code:
Input:                  	9773111 reads 		732983325 bases.
Low quality discards:   	5013913 reads (51.30%) 	376043475 bases (51.30%)
Total Removed:          	5013913 reads (51.30%) 	376043475 bases (51.30%)
Result:                 	4759198 reads (48.70%) 	356939850 bases (48.70%)

This is the behaviour I am expecting for both "filterbytile" and "bbduk maq=10".

Please, let me know if I'm doing something wrong or if it's not how you intended it to be.

Thank you very much in advance, and very sorry for the lengthy post.

Cheers,
Santiago
santiagorevale is offline   Reply With Quote
Old 10-20-2017, 03:56 AM   #17
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

Sorry if the same post appears many times. I'm having issues with the browser and am not getting feedback after posting. If all the postings appears, just keep the last one. Thanks!
santiagorevale is offline   Reply With Quote
Reply

Tags
bbduk, bbmap, filterbytile, flow cell, quality filtering

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:02 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO