Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MAQ - colorspace alignment troubles

    Hi all,
    before I went on to trying BFAST and BowTie for my colorspace-alignment problem,
    I thought I'd ask arround here, because this is tooo much of bug-like to have been missed by a lot of people.

    So I have this nice little colorspace dataset (70mio 50nt reads, SE)
    and feed it to maq, reference is a mightyly reduced hg18 set.

    Steps 1-5 (from http://maq.sourceforge.net/color.shtml ) work fine, 6 - the mapping, too,
    an intermediate maq merge is fine too.

    Step 7 has a nifty little requirement that had me debugging MAQ for a day
    Code:
    maq csmap2nt aln.nt.map ref.bfa aln.cs.map
    (It uses a hash based on the seq-name, and with multiple identical fasta-tags, discards most matches)

    On to the usual SNP-calling, but oh wonder:
    I'm getting tons of SNPs - below in the pileup-view:
    Code:
    entg|EIF1AY:ccds|CCDS14795.1_1  1       A       0       @       
    entg|EIF1AY:ccds|CCDS14795.1_1  2       T       0       @
    entg|EIF1AY:ccds|CCDS14795.1_1  3       A       0       @ 
    entg|EIF1AY:ccds|CCDS14795.1_1  4       G       0       @  
    entg|EIF1AY:ccds|CCDS14795.1_1  5       C       1       @a 
    entg|EIF1AY:ccds|CCDS14795.1_1  6       A       2       @.,  
    entg|EIF1AY:ccds|CCDS14795.1_1  7       A       2       @., 
    entg|EIF1AY:ccds|CCDS14795.1_1  8       A       2       @gG
    entg|EIF1AY:ccds|CCDS14795.1_1  9       G       3       @.,,   
    entg|EIF1AY:ccds|CCDS14795.1_1  10      A       4       @.CCC
    entg|EIF1AY:ccds|CCDS14795.1_1  11      C       4       @gGGG 
    entg|EIF1AY:ccds|CCDS14795.1_1  12      T       4       @aAAA   
    entg|EIF1AY:ccds|CCDS14795.1_1  13      T       7       @cCCCCCC 
    entg|EIF1AY:ccds|CCDS14795.1_1  14      G       8       @aAAAAAAA 
    entg|EIF1AY:ccds|CCDS14795.1_1  15      G       9       @.,,,,,,,, 
    entg|EIF1AY:ccds|CCDS14795.1_1  16      A       9       @.,,,,,,,, 
    entg|EIF1AY:ccds|CCDS14795.1_1  17      A       9       @cCCCCCCCC 
    entg|EIF1AY:ccds|CCDS14795.1_1  18      C       9       @aAAAAAAAA 
    entg|EIF1AY:ccds|CCDS14795.1_1  19      C       10      @.,,,,,,,,,
    entg|EIF1AY:ccds|CCDS14795.1_1  20      A       10      @.,,,,,,,,, 
    entg|EIF1AY:ccds|CCDS14795.1_1  21      A       10      @cCCCCCCCCC 
    entg|EIF1AY:ccds|CCDS14795.1_1  22      C       10      @aAAAAAAAAA  
    entg|EIF1AY:ccds|CCDS14795.1_1  23      C       10      @tAAAAAAAAA 
    entg|EIF1AY:ccds|CCDS14795.1_1  24      C       11      @g,,,,,,,,,. 
    entg|EIF1AY:ccds|CCDS14795.1_1  25      A       12      @.,,,,,,,,,.,   
    entg|EIF1AY:ccds|CCDS14795.1_1  26      A       12      @.,,,,,,,,,.,   
    entg|EIF1AY:ccds|CCDS14795.1_1  27      A       12      @tTTTTTTTTTtT   
    entg|EIF1AY:ccds|CCDS14795.1_1  28      T       13      @cCCCCCCCCC.CC  
    entg|EIF1AY:ccds|CCDS14795.1_1  29      G       14      @cCCCCCCCCCcCCt 
    entg|EIF1AY:ccds|CCDS14795.1_1  30      T       14      @gGGGGGGGGGgGGg 
    entg|EIF1AY:ccds|CCDS14795.1_1  31      C       15      @aAAAAAAAAAaAAaA 
    entg|EIF1AY:ccds|CCDS14795.1_1  32      C       16      @.,,,,,,,,,.,,.,,
    this goes on like this - with every second or third position being a close to 100% pure - hm, on other occasions, I'd tend to call it SNP - something.

    So I went ahead, and extracted one of the sequences contributing to the above pileup,
    extracted it from the csfasta file, decoded it, matched manually the two sequence-strings
    (one supplied by maq mapview, one by the conversion of csfasta to fasta)
    Code:
    ttggaaccaacccaaatgtccaacaatgatagactggattaagaaaatgcggcacatatacaccatgg
      TGAACCAACCCAAATGTCCAACAATGATAGACTGGATTAAGAAAATGTGAT
       GACACACAACAATCCGACacCATCgTTGgCGCAgtATAggaaatcccgt
    Line 2 begins at pileup pos. 14 - or rather, what we see listed in the pileup is the sequence from line 3. Totally *NOT* matching. Generating enourmous amounts of SNPs.
    While the manually csfasta-fasta converted sequence matches close to perfect - just what I'd expect.

    So anyone an idea whats going on? Someone ought to have seen something alike since I did nothing more but follow the plot, using a unmodified maq-0.7.1, self comiled on Fedora 12 x86_64.

    Ah, btw. bwa segfaults on the same dataset when trying to do the 'samse' step.

    Best
    -Jonathan

  • #2
    Originally posted by Jonathan View Post
    So I went ahead, and extracted one of the sequences contributing to the above pileup,
    extracted it from the csfasta file, decoded it, matched manually the two sequence-strings
    (one supplied by maq mapview, one by the conversion of csfasta to fasta)
    Code:
    ttggaaccaacccaaatgtccaacaatgatagactggattaagaaaatgcggcacatatacaccatgg
      TGAACCAACCCAAATGTCCAACAATGATAGACTGGATTAAGAAAATGTGAT
       GACACACAACAATCCGACacCATCgTTGgCGCAgtATAggaaatcccgt
    Line 2 begins at pileup pos. 14 - or rather, what we see listed in the pileup is the sequence from line 3. Totally *NOT* matching. Generating enourmous amounts of SNPs.
    While the manually csfasta-fasta converted sequence matches close to perfect - just what I'd expect.
    A note on the above: I just found out that the missmatching line (the 3rd) is the 'double encoded'(DE) colorspace line. This encoding takes place when merging the csfasta and _qv.qual using the solid2fastq.pl script supplied with MAQ.

    The csmap2nt step *should* do some conversion, though I have yet to understand that code and then again yet to see why it fails to convert the DE-colorspace-reads to nt-reads

    Best
    -Jonathan

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 08:47 AM
    0 responses
    14 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    54 views
    0 likes
    Last Post seqadmin  
    Working...
    X