MAQ - colorspace alignment troubles

Jonathan

Member

Join Date: Jun 2009
Posts: 36

MAQ - colorspace alignment troubles

01-20-2010, 08:07 AM

Hi all,
before I went on to trying BFAST and BowTie for my colorspace-alignment problem,
I thought I'd ask arround here, because this is tooo much of bug-like to have been missed by a lot of people.

So I have this nice little colorspace dataset (70mio 50nt reads, SE)
and feed it to maq, reference is a mightyly reduced hg18 set.

Steps 1-5 (from http://maq.sourceforge.net/color.shtml ) work fine, 6 - the mapping, too,
an intermediate maq merge is fine too.

Step 7 has a nifty little requirement that had me debugging MAQ for a day

Code:

maq csmap2nt aln.nt.map ref.bfa aln.cs.map

(It uses a hash based on the seq-name, and with multiple identical fasta-tags, discards most matches)

On to the usual SNP-calling, but oh wonder:
I'm getting tons of SNPs - below in the pileup-view:

Code:

entg|EIF1AY:ccds|CCDS14795.1_1  1       A       0       @       
entg|EIF1AY:ccds|CCDS14795.1_1  2       T       0       @
entg|EIF1AY:ccds|CCDS14795.1_1  3       A       0       @ 
entg|EIF1AY:ccds|CCDS14795.1_1  4       G       0       @  
entg|EIF1AY:ccds|CCDS14795.1_1  5       C       1       @a 
entg|EIF1AY:ccds|CCDS14795.1_1  6       A       2       @.,  
entg|EIF1AY:ccds|CCDS14795.1_1  7       A       2       @., 
entg|EIF1AY:ccds|CCDS14795.1_1  8       A       2       @gG
entg|EIF1AY:ccds|CCDS14795.1_1  9       G       3       @.,,   
entg|EIF1AY:ccds|CCDS14795.1_1  10      A       4       @.CCC
entg|EIF1AY:ccds|CCDS14795.1_1  11      C       4       @gGGG 
entg|EIF1AY:ccds|CCDS14795.1_1  12      T       4       @aAAA   
entg|EIF1AY:ccds|CCDS14795.1_1  13      T       7       @cCCCCCC 
entg|EIF1AY:ccds|CCDS14795.1_1  14      G       8       @aAAAAAAA 
entg|EIF1AY:ccds|CCDS14795.1_1  15      G       9       @.,,,,,,,, 
entg|EIF1AY:ccds|CCDS14795.1_1  16      A       9       @.,,,,,,,, 
entg|EIF1AY:ccds|CCDS14795.1_1  17      A       9       @cCCCCCCCC 
entg|EIF1AY:ccds|CCDS14795.1_1  18      C       9       @aAAAAAAAA 
entg|EIF1AY:ccds|CCDS14795.1_1  19      C       10      @.,,,,,,,,,
entg|EIF1AY:ccds|CCDS14795.1_1  20      A       10      @.,,,,,,,,, 
entg|EIF1AY:ccds|CCDS14795.1_1  21      A       10      @cCCCCCCCCC 
entg|EIF1AY:ccds|CCDS14795.1_1  22      C       10      @aAAAAAAAAA  
entg|EIF1AY:ccds|CCDS14795.1_1  23      C       10      @tAAAAAAAAA 
entg|EIF1AY:ccds|CCDS14795.1_1  24      C       11      @g,,,,,,,,,. 
entg|EIF1AY:ccds|CCDS14795.1_1  25      A       12      @.,,,,,,,,,.,   
entg|EIF1AY:ccds|CCDS14795.1_1  26      A       12      @.,,,,,,,,,.,   
entg|EIF1AY:ccds|CCDS14795.1_1  27      A       12      @tTTTTTTTTTtT   
entg|EIF1AY:ccds|CCDS14795.1_1  28      T       13      @cCCCCCCCCC.CC  
entg|EIF1AY:ccds|CCDS14795.1_1  29      G       14      @cCCCCCCCCCcCCt 
entg|EIF1AY:ccds|CCDS14795.1_1  30      T       14      @gGGGGGGGGGgGGg 
entg|EIF1AY:ccds|CCDS14795.1_1  31      C       15      @aAAAAAAAAAaAAaA 
entg|EIF1AY:ccds|CCDS14795.1_1  32      C       16      @.,,,,,,,,,.,,.,,

this goes on like this - with every second or third position being a close to 100% pure - hm, on other occasions, I'd tend to call it SNP - something.

So I went ahead, and extracted one of the sequences contributing to the above pileup,
extracted it from the csfasta file, decoded it, matched manually the two sequence-strings
(one supplied by maq mapview, one by the conversion of csfasta to fasta)

Code:

ttggaaccaacccaaatgtccaacaatgatagactggattaagaaaatgcggcacatatacaccatgg
  TGAACCAACCCAAATGTCCAACAATGATAGACTGGATTAAGAAAATGTGAT
   GACACACAACAATCCGACacCATCgTTGgCGCAgtATAggaaatcccgt

Line 2 begins at pileup pos. 14 - or rather, what we see listed in the pileup is the sequence from line 3. Totally *NOT* matching. Generating enourmous amounts of SNPs.
While the manually csfasta-fasta converted sequence matches close to perfect - just what I'd expect.

So anyone an idea whats going on? Someone ought to have seen something alike since I did nothing more but follow the plot, using a unmodified maq-0.7.1, self comiled on Fedora 12 x86_64.

Ah, btw. bwa segfaults on the same dataset when trying to do the 'samse' step.

Best
-Jonathan

Tags: None

Jonathan

Member

Join Date: Jun 2009

Posts: 36
- Share
- Tweet
#2

01-20-2010, 11:16 PM

Originally posted by Jonathan View Post

So I went ahead, and extracted one of the sequences contributing to the above pileup,
extracted it from the csfasta file, decoded it, matched manually the two sequence-strings
(one supplied by maq mapview, one by the conversion of csfasta to fasta)

Code:

ttggaaccaacccaaatgtccaacaatgatagactggattaagaaaatgcggcacatatacaccatgg TGAACCAACCCAAATGTCCAACAATGATAGACTGGATTAAGAAAATGTGAT GACACACAACAATCCGACacCATCgTTGgCGCAgtATAggaaatcccgt

Line 2 begins at pileup pos. 14 - or rather, what we see listed in the pileup is the sequence from line 3. Totally *NOT* matching. Generating enourmous amounts of SNPs.
While the manually csfasta-fasta converted sequence matches close to perfect - just what I'd expect.

A note on the above: I just found out that the missmatching line (the 3rd) is the 'double encoded'(DE) colorspace line. This encoding takes place when merging the csfasta and _qv.qual using the solid2fastq.pl script supplied with MAQ.

The csmap2nt step *should* do some conversion, though I have yet to understand that code and then again yet to see why it fails to convert the DE-colorspace-reads to nt-reads

Best
-Jonathan
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

MAQ - colorspace alignment troubles

Comment

Latest Articles

ad_right_rmr

News