SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GATK base quality recalibration suppose to keep old and new quality scores? Heisman Bioinformatics 2 10-21-2011 07:40 AM
GAII low number of mapped reads aligenie Bioinformatics 3 06-21-2011 11:55 PM
Illumina 1.3 v 1.8 quality scores Graham Etherington Bioinformatics 1 10-18-2010 07:00 AM
low percentage of reads mapped rahilsethi SOLiD 3 09-13-2010 06:01 AM
Quality trimmming / Mask low quality bases? bbimber Bioinformatics 9 03-25-2010 01:40 PM

Reply
 
Thread Tools
Old 03-23-2010, 07:13 AM   #1
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default quality scores, low mapped%?

Hi,

I'm trying to figure out why I am getting such a low % of mapped reads (using tophat/bowtie). I'm still experimenting with parameters in bowtie, but thus far, I can't get much above 30%. I'm working with a new genome with plenty of gaps and things, so that might explain part of it. BUt I also don't fully understand the quality scores. Do these look funny to any of you? It seems to be the higher quality scores are on the end of these reads? (the Bs?). Any thoughts?

@HWI-EAS385_0044:2:1:4:1884#0/1
CAGCTGGNAGGCTCCACGGCGGGCGTGCGCCAAGTGCCGGGGCTGCACAACGGGAGCCAAGCCTTCCTCTTCTCA
+HWI-EAS385_0044:2:1:4:1884#0/1
\\X[\LTDTT_Vb__X_Z_V`XUceZcfcc_PTPKVb__\]bee]]X_BBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS385_0044:2:1:4:1477#0/1
GGGCCATNGCATCTGTGGGCACGGGAGGGGCCAGCACAGCCGCAGGACTACTGGCCGAGGCCCCCGCCGCGGCAG
+HWI-EAS385_0044:2:1:4:1477#0/1
ecdce[bE]`TTTSS\Wb\bTW^XNRMURVO\PX]Q`N^^R]SK\\\MVM\P\V^M[LPX`^BBBBBBBBBBBBB
@HWI-EAS385_0044:2:1:4:849#0/1
GTCGTACTCCTAGGGCTCGTGGTCGGCTGCGCCGGCTTGTCGTTTCGCTTCGCCTGCGGGCTGGGCTCCGTCGTG
+HWI-EAS385_0044:2:1:4:849#0/1
bXb_[`c_cc\U`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
chrisbala is offline   Reply With Quote
Old 03-23-2010, 07:33 AM   #2
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

If you map human reads to human, good aligners can map 95% of them. If you align these reads to chimpanzee, which is 1.2-1.3% different from human, about 90% can be mapped. If you are talking about 5-10% mismatching rate, most short aligners would not work well. Perhaps ssaha2 is less affected. In addition, bowtie does not do gapped alignment. Also tuning bowtie "-e" may help. Alternatively, you may consider to de novo assemble your reads first and then align the contigs.
lh3 is offline   Reply With Quote
Old 03-23-2010, 07:35 AM   #3
strob
Member
 
Location: Belgium

Join Date: Nov 2008
Posts: 79
Default

maybe handy to first read this paper in order to know what is what:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

Cock et al, 2009

Published in NAR
strob is offline   Reply With Quote
Old 03-23-2010, 07:47 AM   #4
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

i assume the spaces i see in your sequences is an artifact of copy-paste?
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 03-23-2010, 07:55 AM   #5
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default spaces

thanks for the responses

yes the space is a copy-paste thing

also, i thought I did understand the quality scores ... just checking that I am correct in my understanding ... but I think I got it .. I had the quality scores backwards ... so Bs are actually quite bad reads? Much of my data looks like what I posted. Is this about what is expected?
chrisbala is offline   Reply With Quote
Old 03-23-2010, 08:06 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

Quote:
Originally Posted by Zigster View Post
i assume the spaces i see in your sequences is an artifact of copy-paste?
You can avoid this by putting [ code ] and [ /code ] tags round the example. There is a little icon with a # symbol on it on the edit box to make this easy.
maubp is offline   Reply With Quote
Old 03-23-2010, 08:20 AM   #7
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default

thanks. that is good to know...

any thoughts about the Bs???
chrisbala is offline   Reply With Quote
Old 03-23-2010, 08:35 AM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

Quote:
Originally Posted by chrisbala View Post
thanks. that is good to know...

any thoughts about the Bs???
The ASCII code for B is 66, with an offset of 64 as used in Solexa/Illumina gives a quality score of 2 (very poor).

At the start of the reads you have things like X, which is ASCII 88, thus a score of 24, which is OK.

i.e. The start of your reads have OK scores, but this rapidly trails off and the middle and ends of your reads have poor scores.

So yes, you did have the score interpretation backwards in your earlier posts.

[I'm assuming you have Solexa or Illumina style FASTQ files here]
maubp is offline   Reply With Quote
Old 03-23-2010, 08:47 AM   #9
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default uuggh

that is what i feared. and I assume this is, in general, worse that what people usually get in their Illumina data?
chrisbala is offline   Reply With Quote
Old 03-23-2010, 08:59 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.
maubp is offline   Reply With Quote
Old 03-24-2010, 08:44 AM   #11
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,146
Default

Quote:
Originally Posted by maubp View Post
Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.
Due to the way the Illumina pipeline (or RTA) sorts the output to FASTQ files the reads at the beginning of the file always look bad. A FASTQ file for a lane of GAII data will have the reads sorted first by tile # and then by x-coordinate. Thus at the start of the file (or really at the start of every block of reads for each tile) you will have reads from the extreme edge of the tile. Reads at the edge are inherently poorer quality. You can't make any assessment about the overall quality of the run by looking at a few, non-randomly selected reads.

The reads at the top of my FASTQ files always have Q-scores like the ones shown here.
kmcarr is offline   Reply With Quote
Old 03-24-2010, 09:22 AM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

Nice tip kmcarr
maubp is offline   Reply With Quote
Old 03-24-2010, 09:42 AM   #13
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default

yeah, thanks for that. the sequencing group here also pointed that out to me (I should have posted a followup). so now i am doing some real QC.... (but i still still think the data quality might be a bit low)
chrisbala is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:01 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO