MAQ - colorspace alignment troubles

Jonathan

Member

Join Date: Jun 2009
Posts: 36

MAQ - colorspace alignment troubles

01-20-2010, 08:07 AM

Hi all,
before I went on to trying BFAST and BowTie for my colorspace-alignment problem,
I thought I'd ask arround here, because this is tooo much of bug-like to have been missed by a lot of people.

So I have this nice little colorspace dataset (70mio 50nt reads, SE)
and feed it to maq, reference is a mightyly reduced hg18 set.

Steps 1-5 (from http://maq.sourceforge.net/color.shtml ) work fine, 6 - the mapping, too,
an intermediate maq merge is fine too.

Step 7 has a nifty little requirement that had me debugging MAQ for a day

Code:

maq csmap2nt aln.nt.map ref.bfa aln.cs.map

(It uses a hash based on the seq-name, and with multiple identical fasta-tags, discards most matches)

On to the usual SNP-calling, but oh wonder:
I'm getting tons of SNPs - below in the pileup-view:

Code:

entg|EIF1AY:ccds|CCDS14795.1_1  1       A       0       @       
entg|EIF1AY:ccds|CCDS14795.1_1  2       T       0       @
entg|EIF1AY:ccds|CCDS14795.1_1  3       A       0       @ 
entg|EIF1AY:ccds|CCDS14795.1_1  4       G       0       @  
entg|EIF1AY:ccds|CCDS14795.1_1  5       C       1       @a 
entg|EIF1AY:ccds|CCDS14795.1_1  6       A       2       @.,  
entg|EIF1AY:ccds|CCDS14795.1_1  7       A       2       @., 
entg|EIF1AY:ccds|CCDS14795.1_1  8       A       2       @gG
entg|EIF1AY:ccds|CCDS14795.1_1  9       G       3       @.,,   
entg|EIF1AY:ccds|CCDS14795.1_1  10      A       4       @.CCC
entg|EIF1AY:ccds|CCDS14795.1_1  11      C       4       @gGGG 
entg|EIF1AY:ccds|CCDS14795.1_1  12      T       4       @aAAA   
entg|EIF1AY:ccds|CCDS14795.1_1  13      T       7       @cCCCCCC 
entg|EIF1AY:ccds|CCDS14795.1_1  14      G       8       @aAAAAAAA 
entg|EIF1AY:ccds|CCDS14795.1_1  15      G       9       @.,,,,,,,, 
entg|EIF1AY:ccds|CCDS14795.1_1  16      A       9       @.,,,,,,,, 
entg|EIF1AY:ccds|CCDS14795.1_1  17      A       9       @cCCCCCCCC 
entg|EIF1AY:ccds|CCDS14795.1_1  18      C       9       @aAAAAAAAA 
entg|EIF1AY:ccds|CCDS14795.1_1  19      C       10      @.,,,,,,,,,
entg|EIF1AY:ccds|CCDS14795.1_1  20      A       10      @.,,,,,,,,, 
entg|EIF1AY:ccds|CCDS14795.1_1  21      A       10      @cCCCCCCCCC 
entg|EIF1AY:ccds|CCDS14795.1_1  22      C       10      @aAAAAAAAAA  
entg|EIF1AY:ccds|CCDS14795.1_1  23      C       10      @tAAAAAAAAA 
entg|EIF1AY:ccds|CCDS14795.1_1  24      C       11      @g,,,,,,,,,. 
entg|EIF1AY:ccds|CCDS14795.1_1  25      A       12      @.,,,,,,,,,.,   
entg|EIF1AY:ccds|CCDS14795.1_1  26      A       12      @.,,,,,,,,,.,   
entg|EIF1AY:ccds|CCDS14795.1_1  27      A       12      @tTTTTTTTTTtT   
entg|EIF1AY:ccds|CCDS14795.1_1  28      T       13      @cCCCCCCCCC.CC  
entg|EIF1AY:ccds|CCDS14795.1_1  29      G       14      @cCCCCCCCCCcCCt 
entg|EIF1AY:ccds|CCDS14795.1_1  30      T       14      @gGGGGGGGGGgGGg 
entg|EIF1AY:ccds|CCDS14795.1_1  31      C       15      @aAAAAAAAAAaAAaA 
entg|EIF1AY:ccds|CCDS14795.1_1  32      C       16      @.,,,,,,,,,.,,.,,

this goes on like this - with every second or third position being a close to 100% pure - hm, on other occasions, I'd tend to call it SNP - something.

So I went ahead, and extracted one of the sequences contributing to the above pileup,
extracted it from the csfasta file, decoded it, matched manually the two sequence-strings
(one supplied by maq mapview, one by the conversion of csfasta to fasta)

Code:

ttggaaccaacccaaatgtccaacaatgatagactggattaagaaaatgcggcacatatacaccatgg
  TGAACCAACCCAAATGTCCAACAATGATAGACTGGATTAAGAAAATGTGAT
   GACACACAACAATCCGACacCATCgTTGgCGCAgtATAggaaatcccgt

Line 2 begins at pileup pos. 14 - or rather, what we see listed in the pileup is the sequence from line 3. Totally *NOT* matching. Generating enourmous amounts of SNPs.
While the manually csfasta-fasta converted sequence matches close to perfect - just what I'd expect.

So anyone an idea whats going on? Someone ought to have seen something alike since I did nothing more but follow the plot, using a unmodified maq-0.7.1, self comiled on Fedora 12 x86_64.

Ah, btw. bwa segfaults on the same dataset when trying to do the 'samse' step.

Best
-Jonathan

Tags: None

Jonathan

Member

Join Date: Jun 2009

Posts: 36
- Share
- Tweet
#2

01-20-2010, 11:16 PM

Originally posted by Jonathan View Post

So I went ahead, and extracted one of the sequences contributing to the above pileup,
extracted it from the csfasta file, decoded it, matched manually the two sequence-strings
(one supplied by maq mapview, one by the conversion of csfasta to fasta)

Code:

ttggaaccaacccaaatgtccaacaatgatagactggattaagaaaatgcggcacatatacaccatgg TGAACCAACCCAAATGTCCAACAATGATAGACTGGATTAAGAAAATGTGAT GACACACAACAATCCGACacCATCgTTGgCGCAgtATAggaaatcccgt

Line 2 begins at pileup pos. 14 - or rather, what we see listed in the pileup is the sequence from line 3. Totally *NOT* matching. Generating enourmous amounts of SNPs.
While the manually csfasta-fasta converted sequence matches close to perfect - just what I'd expect.

A note on the above: I just found out that the missmatching line (the 3rd) is the 'double encoded'(DE) colorspace line. This encoding takes place when merging the csfasta and _qv.qual using the solid2fastq.pl script supplied with MAQ.

The csmap2nt step *should* do some conversion, though I have yet to understand that code and then again yet to see why it fails to convert the DE-colorspace-reads to nt-reads

Best
-Jonathan
Comment

Previous template Next

Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing

by GATTACAT

Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
- Channel: Articles
07-01-2026, 11:43 AM
Nine Things a Sample Prep Scientist Thinks About Before Sequencing

by SEQadmin2

I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

Here are nine questions we think about, in roughly the order they matter, before...
- Channel: Articles
06-18-2026, 07:11 AM

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

MAQ - colorspace alignment troubles

Comment

Latest Articles

ad_right_rmr

News