Unconfigured Ad

**lh3** · 03-23-2010, 07:33 AM

If you map human reads to human, good aligners can map 95% of them. If you align these reads to chimpanzee, which is 1.2-1.3% different from human, about 90% can be mapped. If you are talking about 5-10% mismatching rate, most short aligners would not work well. Perhaps ssaha2 is less affected. In addition, bowtie does not do gapped alignment. Also tuning bowtie "-e" may help. Alternatively, you may consider to de novo assemble your reads first and then align the contigs.

**strob** · 03-23-2010, 07:35 AM

maybe handy to first read this paper in order to know what is what:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

Cock et al, 2009

Published in NAR

**Zigster** · 03-23-2010, 07:47 AM

i assume the spaces i see in your sequences is an artifact of copy-paste?

**chrisbala** · 03-23-2010, 07:55 AM

spaces

thanks for the responses

yes the space is a copy-paste thing

also, i thought I did understand the quality scores ... just checking that I am correct in my understanding ... but I think I got it .. I had the quality scores backwards ... so Bs are actually quite bad reads? Much of my data looks like what I posted. Is this about what is expected?

**maubp** · 03-23-2010, 08:06 AM

Originally posted by Zigster View Post

i assume the spaces i see in your sequences is an artifact of copy-paste?

You can avoid this by putting [ code ] and [ /code ] tags round the example. There is a little icon with a # symbol on it on the edit box to make this easy.

**chrisbala** · 03-23-2010, 08:20 AM

thanks. that is good to know...

any thoughts about the Bs???

**maubp** · 03-23-2010, 08:35 AM

Originally posted by chrisbala View Post

thanks. that is good to know...

any thoughts about the Bs???

The ASCII code for B is 66, with an offset of 64 as used in Solexa/Illumina gives a quality score of 2 (very poor).

At the start of the reads you have things like X, which is ASCII 88, thus a score of 24, which is OK.

i.e. The start of your reads have OK scores, but this rapidly trails off and the middle and ends of your reads have poor scores.

So yes, you did have the score interpretation backwards in your earlier posts.

[I'm assuming you have Solexa or Illumina style FASTQ files here]

**chrisbala** · 03-23-2010, 08:47 AM

uuggh

that is what i feared. and I assume this is, in general, worse that what people usually get in their Illumina data?

**maubp** · 03-23-2010, 08:59 AM

Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.

**kmcarr** · 03-24-2010, 08:44 AM

Originally posted by maubp View Post

Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.

Due to the way the Illumina pipeline (or RTA) sorts the output to FASTQ files the reads at the beginning of the file always look bad. A FASTQ file for a lane of GAII data will have the reads sorted first by tile # and then by x-coordinate. Thus at the start of the file (or really at the start of every block of reads for each tile) you will have reads from the extreme edge of the tile. Reads at the edge are inherently poorer quality. You can't make any assessment about the overall quality of the run by looking at a few, non-randomly selected reads.

The reads at the top of my FASTQ files always have Q-scores like the ones shown here.

**maubp** · 03-24-2010, 09:22 AM

Nice tip kmcarr

**chrisbala** · 03-24-2010, 09:42 AM

yeah, thanks for that. the sequencing group here also pointed that out to me (I should have posted a followup). so now i am doing some real QC.... (but i still still think the data quality might be a bit low)

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, Today, 11:10 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 102 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 124 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

quality scores, low mapped%?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News