Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error using HTSeq on Mapsplice alignments

    Hello,

    I am trying to use HTSeq-0.5.3p9 with python2.7 to get the counts from genomic regions defined in a gff file.
    My data are BAM files generated by MapSplice on (human) RNA-seq paired-end reads from Illumina HiSeq.
    I sorted the BAM files by name with "samtools sort -n" and then transformed them into SAM file by selecting properly paired reads and discarding unmapped and 'mate unmapped' reads ("samtools view -F 12 -f 3").

    But applying HTSeq on these data I have an error:
    > python -m HTSeq.scripts.count -i ID -s reverse data.sam regions.gff > test.counts

    Error occured in line 230375 of file data.sam.
    Error: ("Malformed SAM line: MRNM == '*' although flag bit &0x0008 cleared", 'line 230375 of file data.sam')

    Here are the lines 230375 and 230376 of the SAM file:
    id1/1 403 chr14 21679407 255 29S19M * 0 0 seq1 JJJJJJJJJIJJJJJJJGIGGGGIJHIIJBC@FFFFFHHHHHJJJJJJ XF:Z:GTAG, ZF:Z:FUS_86736804_21679425(+-) RG:Z:id1 NM:i:1 XS:A:+
    id1/1 99 chr11 86736776 255 29M19S * 0 0 seq1 JJJJJJJJJIJJJJJJJGIGGGGIJHIIJBC@FFFFFHHHHHJJJJJJ XF:Z:GTAG, ZF:Z:FUS_86736804_21679425(+-) RG:Z:id1 NM:i:1 XS:A:+

    We can see that flag &0x0008 is indeed cleared, i.e the mate is noticed as mapped, but the MRNM (Mate Reference NaMe) is "*". In addition, no mate seems to be available in the SAM file (no "id1/2" is available).
    The flag &0x0008 seems thus to be incorrectly specified by MapSplice, I guess there is an explanation regarding the fact that this is a Fusion alignment (http://www.netlab.uky.edu/p/bioinfo/...lignmentFormat).

    I read here https://stat.ethz.ch/pipermail/bioco...st/040614.html that before this error was only a warning.

    Is there a way I could deal with these MapSplice fusion alignments using HTSeq?

    Best,
    Anne

  • #2
    I really wish the SAM format specification had been written with a bit more care. Then, we would not have to deal with so many variation in interpreting it and in working around its limitation.

    For example, I understand it to say that for paired reads, each alignment consists of two lines, and hence, the number of lines with a given read ID must be even. The MapSplice people have read it differently, and I cannot blame them because this is nowhere said clearly.

    Sorry for the rant. To be more constructive: Yes, it may be necessary to change this from an error to a warning. I'll see when I find the time for that. Could take a while, sorry; I'm a bit overwhelmed with stuff these days.

    By the way, it has always been an error. The warning was in the opposite case, i.e., flag bit set but MRNM != "*".

    Comment


    • #3
      Here's a program called bamfixunmaps.c which ... fixes unmapped entries in paired end bam files. This will force unmapped reads to have both (TID=-1 ) and ("query sequence itself is unmapped bit" set to 1 [ON] ). It will set the mate read correctly.

      This is a hack of lh3's fixmate program in samtools examples.

      This often shuts up "sam validator" programs.

      Feel free to modify to your needs.


      Code:
      #include <stdlib.h>
      #include <string.h>
      #include "bam.h"
      
      /*
      Input is bamfile sorted by name (samtools -n) with duplicate reads removed (i.e. two paired end entries only).  output is fixed bamfile which will need to be sorted by location and indexed. 
      
      HOW TO COMPILE:
      EDIT  THE -I and -L parameters to point to your samtools installation for headers and libraries :
      gcc -Wall -O2 bamfixunmaps.c -o bamfixunmaps -I /h1/finneyr/samtools-0.1.18/  -L /h1/finneyr/samtools-0.1.18/ -lbam -lz
      */
      
      
      void bam_mating_core(bamFile in, bamFile out)
      {
          bam_header_t *header;
          bam1_t *b[2];
          int curr, has_prev;
      
          header = bam_header_read(in);
          bam_header_write(out, header);
      
          b[0] = bam_init1();
          b[1] = bam_init1();
          curr = 0; has_prev = 0;
          while (bam_read1(in, b[curr]) >= 0)
          {
              bam1_t *cur = b[curr], *pre = b[1-curr];
              if ((cur->core.flag&(BAM_FUNMAP)) || (cur->core.tid < 0))   // flag says "this is unmapped" OR pos is '*' so make consistent
              {
                   cur->core.qual = 0;
                   cur->core.tid  = -1;
                   cur->core.pos  = -1; // prints as 0 in sam
                   cur->core.flag |= BAM_FUNMAP;  // turn on unmapped bit
              }
              if (has_prev) {
                      if (strcmp(bam1_qname(cur), bam1_qname(pre)) == 0) { // identical pair name
                              cur->core.mtid = pre->core.tid; cur->core.mpos = pre->core.pos;
                              pre->core.mtid = cur->core.tid; pre->core.mpos = cur->core.pos;
                      if (pre->core.flag&BAM_FUNMAP) // pre is not mapped
                         cur->core.flag |= BAM_FMUNMAP;  // turn on mate unmapped bit
                      else
                         cur->core.flag &= ~BAM_FMUNMAP; // turn off mate unmapped bit
                      if (cur->core.flag&BAM_FUNMAP)     // cur is not NOT mapped
                         pre->core.flag |= BAM_FMUNMAP;  // turn on unmapped bit
                      else
                         pre->core.flag &= ~BAM_FMUNMAP;  // turn off unmapped bit
                              if (pre->core.tid == cur->core.tid && !(cur->core.flag&(BAM_FUNMAP|BAM_FMUNMAP))
                                      && !(pre->core.flag&(BAM_FUNMAP|BAM_FMUNMAP)))
                              {
                                      uint32_t cur5, pre5;
                                      cur5 = (cur->core.flag&BAM_FREVERSE)? bam_calend(&cur->core, bam1_cigar(cur)) : cur->core.pos;
                                      pre5 = (pre->core.flag&BAM_FREVERSE)? bam_calend(&pre->core, bam1_cigar(pre)) : pre->core.pos;
                                      cur->core.isize = pre5 - cur5; pre->core.isize = cur5 - pre5;
                              } else cur->core.isize = pre->core.isize = 0;
                              if (pre->core.flag&BAM_FREVERSE) cur->core.flag |= BAM_FMREVERSE;
                              else cur->core.flag &= ~BAM_FMREVERSE;
                              if (cur->core.flag&BAM_FREVERSE) pre->core.flag |= BAM_FMREVERSE;
                              else pre->core.flag &= ~BAM_FMREVERSE;
                              if (cur->core.flag & BAM_FUNMAP) { pre->core.flag |= BAM_FMUNMAP; pre->core.flag &= ~BAM_FPROPER_PAIR; }
                              if (pre->core.flag & BAM_FUNMAP) { cur->core.flag |= BAM_FMUNMAP; cur->core.flag &= ~BAM_FPROPER_PAIR; }
                              bam_write1(out, pre);
                              bam_write1(out, cur);
                              has_prev = 0;
                      } else { // unpaired or singleton
                              pre->core.mtid = -1; pre->core.mpos = -1; pre->core.isize = 0;
                              if (pre->core.flag & BAM_FPAIRED) {
                                      pre->core.flag |= BAM_FMUNMAP;
                                      pre->core.flag &= ~BAM_FMREVERSE & ~BAM_FPROPER_PAIR;
                              }
                              bam_write1(out, pre);
                      }
              } else has_prev = 1;
              curr = 1 - curr;
          }
          if (has_prev) bam_write1(out, b[1-curr]);
          bam_header_destroy(header);
          bam_destroy1(b[0]);
          bam_destroy1(b[1]);
      }
      
      int main(int argc, char *argv[])
      {
          bamFile in, out;
          if (argc < 3) {
              fprintf(stderr, "samtools bamfixunmap <in.nameSrt.bam> <out.nameSrt.bam>\n");
              return 1;
          }
          in = (strcmp(argv[1], "-") == 0)? bam_dopen(fileno(stdin), "r") : bam_open(argv[1], "r");
          out = (strcmp(argv[2], "-") == 0)? bam_dopen(fileno(stdout), "w") : bam_open(argv[2], "w");
          bam_mating_core(in, out);
          bam_close(in); bam_close(out);
          return 0;
      }
      Last edited by Richard Finney; 08-22-2012, 07:49 AM.

      Comment


      • #4
        Many thanks for your replies!
        I temporarily discarded these fusion reads and I will have a look at the script to modify their flags.

        After having discarded them, I have HTSeq warnings saying that it does not manage to match read pairs. I am sure that my SAM file is sorted by name (I used samtools sort -n) and that the matching reads appear only once in the SAM file.

        Does someone know another specificity of the MapSplice BAM/SAM format that would explain this problem?

        Examples of warnings for two mates:
        Warning: Read id1:137098JACXX:1:1101:1414:181569/1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
        Warning: Read id1:137098JACXX:1:1101:1414:181569/2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

        SAM file for the two mates (2 consecutive lines in the SAM file):
        id1:137098JACXX:1:1101:1414:181569/1 83 chr20 10594341 63 48M = 10594243 -146 TATAGGATGGGCCCTCTCCGGCGTCAATAGCTTGCTGAATGATATTCC @DC=GFAGGC<(F?6@>F:GF@C:GIIFAHF<BHFFCADDDDD?<@ RG:Z:111122_UNC13-SN749_0137_BD098JACXX_1_GATCAG IH:i:1HI:i:1 NM:i:1
        id1:137098JACXX:1:1101:1414:181569/2 163 chr20 10594243 66 48M = 10594341 146 CCCCAACTCCTTTCTGCTCATTTCTGCTTCTTTGGCCTCTTCCTGAGC CC@FFFFFHHHFHICHI<FHEADFCFDEHEDEDA<CF>A3?DGGGHGE RG:Z:111122_UNC13-SN749_0137_BD098JACXX_1_GATCAG IH:i:1HI:i:1 NM:i:0

        Comment


        • #5
          For HTSeq, the two lines that you have quoted are not describing mate pairs because they do not have the same read name, which, according to the SAM spec, they must have.

          In other words: You have to remove the "/1" and "/2" at the end.

          Comment


          • #6
            Removing the /1 and /2 worked for this problem, thanks! The MapSplice alignments I have seem to contain a lot of incorrect SAM fields...

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 11:49 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X