Seqanswers Leaderboard Ad

**GenoMax** · 03-31-2015, 10:03 AM

Where exactly are these bad bases located? If you are aligning to a reference you may be able to tolerate "bad" bases but for de novo assemblies you would want to be strict.

**jamespblloyd** · 03-31-2015, 10:09 AM

Originally posted by GenoMax View Post

Where exactly are these bad bases located? If you are aligning to a reference you may be able to tolerate "bad" bases but for de novo assemblies you would want to be strict.

Before mapping they are located throughout the whole read, after mapping with Tophat you see a 3' bias for poor quality reads. Normally I would use mapping to filter out bad reads but the sheer number of them make me concerned that some low quality reads have been falsely mapped.

**GenoMax** · 03-31-2015, 10:24 AM

Are the bad quality bases located at constant positions (cycle #) in reads? What fraction of reads are affected? This could be indicative of some sort of a problem with the run itself.

**jamespblloyd** · 03-31-2015, 10:32 AM

Originally posted by GenoMax View Post

Are the bad quality bases located at constant positions (cycle #) in reads? What fraction of reads are affected? This could be indicative of some sort of a problem with the run itself.

I am not sure what you mean? No the poor bases are throughout not at constant location. (See output from FASTQC attached). It could have been a poor run as both of rep1 from this study look like this. I am re-analysing some data from another lab published in 2013 to get used to our computational pipeline and see if I can see similar results to them.

Attached Files

for_seqanswers.png (12.2 KB, 242 views)

**GenoMax** · 03-31-2015, 02:05 PM

Are you sure this is sanger fastq quality encoding? That run looks pretty marginal with Q-scores all over the place for all cycles. You would definitely want to trim this data before alignment.

**jamespblloyd** · 03-31-2015, 02:39 PM

Originally posted by GenoMax View Post

Are you sure this is sanger fastq quality encoding? That run looks pretty marginal with Q-scores all over the place for all cycles. You would definitely want to trim this data before alignment.

Tophat said it is phred33 and this is the output from FASTQC so I think it is probably correct. Given the low mapping % of about 50% I think they are low quality so it makes sense to me.

Looking at this it suggests the bias is weak and trimming would only have marginal effect, am I wrong to think this? If so where would you trim to?

I have attached the post-mapping quality. Here you see a 3' bias. I was thinking of trimming reads to 80 nt and then map and see what the mapping % is but also just removing some poor reads like described earlier.

Attached Files

post-mapping_reads.png (14.1 KB, 26 views)

**Brian Bushnell** · 03-31-2015, 04:33 PM

It looks like a possibly bimodal distribution where some reads are basically good and others are basically bad. As a result, your data will have a quantitative bias, since quality is somewhat sequence-dependent on Illumina platforms. This bias will be present whether you quality-trim, quality-filter, or map raw reads, but will be minimized by using an aligner that is very robust to errors (e.g. BBMap rather than TopHat). Because the quality issue is present at all positions in entire reads, quality-insensitive trimming to 80bp may not help much.

If you do quality-trimming on this data, I'd suggest a tool that implements an optimal trimming algorithm (BBDuk or seqtk) rather than something like fastx, which will give inferior results, and I'd recommend that you trim on both ends. So, for example:

bbduk.sh in=reads.fq out=trimmed.fq qtrim=rl trimq=15 minlen=50

That will ensure that the resulting trimmed read has an average quality of at least Q15 (and, in fact, is the longest possible substring with such that any terminal substrings have at least Q15), which is fine for accurate mapping with BBMap. Reads shorter than 50bp after trimming will be discarded.

Alternatively, you can trim them on the fly while mapping with BBMap, which will restrict them from being trimmed shorter than a certain length, and just try mapping them anyway. That may minimize the bias incurred by trimming. I can try to give you an optimal command line for BBMap if you want to try it, and if you tell me what kingdom of organism you are mapping to or what its intron lengths are.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How many poor quality bases define a bad read (using fastx or equivalent)?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News