Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Asking for suggestiona on samtools bug fixing

    All the four (actually three) feedbacks on the samtools sourceforge page complain that samtools is "very buggy", one complains poor documentation and another one about poor efficiency. I am worried because I really care about correctness, efficiency and friendliness for all the tools I develop, and the feedbacks argue I failed on all. However, I did not get that many bug reports from the mailing list and I am trying to fix every bug I can find in my daily work. Probably there are others I am not aware of. If you know there remain bugs (not feature request or help request), please reply this thread, in case I overlook some important bug report from the mailing list. As for documentations, I think the current FAQ covers 90% questions raised before. Have I overlooked some important ones?

    Known issues:

    1) Samtools does not write header "SO" tag. This is not required by SAM spec and is not a bug.

    2) Sorting does not work. You have zero or negative coordinates in SAM, which is invalid by the spec. Not a bug in samtools, although it would be good to throw an exception.

    3) The windows port sometimes does not work. This is really a bug, but I am expecting most people are using unix.

    By the way, it is recommended to post questions and requests to the samtools mailing list. You will probably get more help there.

    Thank you.
    Last edited by lh3; 01-22-2010, 08:41 AM.

  • #2
    The fact that SAMTools doesn't put in the sorted tag even after sorting does confuse programs downstream (as well as me).

    Otherwise, I think the notes on the sourceforge page are being pretty unfair. I've run samtools either standalone or via the Perl module on a bunch of data (under ubuntu) and had no obvious bugs -- and I'm pretty good at triggering bugs or misinterpreting documentation! I think I couldn't compile it under Cygwin on a PC, but that seems to happen on 50%+ of packages I try (and I have little skill or patience trying to debug such install issues).

    The documentation is perhaps terse, but to some degree that is a matter of taste. Perhaps you could glean some of the most important points you've made here into an FAQ.

    Thanks for all your projects! I'm looking forward to spending the next week or so putting SAMTools and BWA to some intensive testing, so if I think of anything else I'll let you know.

    Comment


    • #3
      Sofar I've run in major trouble running bwa+samtools for ABI-Colorspace data;

      For one, the conversion script solid2fastq.pl has problems converting quality values of -1 to something correct:

      example: qv-line
      Code:
      -1 32 -1 32 -1
      5 qualities, for 5 color calls obviously.
      the following conversion code (taken from the latest bwa-release pl-scripts)
      Code:
      sub read1 {
        my $i = shift(@_);
        my $j = ($i-1)<<1;
        my ($key, $seq);
        my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]);
        while (<$fhs>) {
              my $t = <$fhq>;
              if (/^>(\d+)_(\d+)_(\d+)_[FR]3/) {
                $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines
                die(qq/** unmatched read name: '$_' != '$_'\n/) unless ($_ eq $t);
                my $name = "$pre:$1_$2_$3/$i";
                $_ = substr(<$fhs>, 2);
                tr/0123./ACGTN/;
                my $s = $_;
                $_ = <$fhq>;
                [B]s/^(\d+)\s*//;
                s/(\d+)\s*/chr($1+33)/eg;[/B]
                $seq = qq/\@$name\n$s+\n$_\n/;
                last;
              }
        }
        return defined($seq)? ($key, $seq) : ();
      }
      has issues, as it will return something like
      Code:
      -"A-"A-"
      (A's might be false, I'm not entirely sure what chr(65) is), but, the gist:
      We suddenly have 8 quality values.
      "bwa samse [...]" will definitly step onto your toes there, by segfaulting without giving you a hint about what is going on. Took me one day to resolve.

      Can be fixed by replacing the bold part from above with
      Code:
                s/^(\d+)\s*//;
                s/-1\s*/chr(33)/eg;
                s/(\d+)\s*/chr($1+33)/eg; #does not work for -1 <- those appear when sequence is ./N!

      After having resolved this, the 'samtools view' command blew up on me.

      'samtools view -b -S -o aln.bam aln.sam'
      results in
      Code:
      [samopen] SAM header is present: 161986 sequences.
      Parse error at line 487981: sequence and quality are inconsistent
      Aborted (core dumped)
      What am I doing wrong? I mean, bwa wrote that sam-file, samtools is from the same author, it *should* know what to expect, no?

      I;m currently looking at the above given line:
      Code:
      IIIA_bwa2:2_1189_1158   0       entg|LANCL1:ccds|CCDS2392.1_2   236     37      2M2D44M2S       *       0       0       TACACACAGTACCTTATAGGCCTGGATGAGCATGTAGATTACCCCAGG        ]]]]]]]]]]]]]PO]]]]]]]]OF]]]OU]WYSR]]XDM]]X]**[ XT:A:U  CM:i:2  X0:i:1  X1:i:0  XM:i:2  XO:i:1  XG:i:1  MD:Z:2^CG44
      Any Idea?
      Best
      -Jonathan
      Last edited by Jonathan; 01-24-2010, 07:09 AM.

      Comment


      • #4
        @krobinson

        Thank you for your support. I will consider to update SO tag when I have time.

        @Jonathan

        Someone has reported this rare bug in bwa (not in samtools). Currently I do not work with SOLiD data any more and it is quite difficult for me to fix bugs in the SOLiD module. I will try your example later, though. In addition, it might be true that bwa is not the best aligner for SOLiD data. People are converging to bfast and bioscope. I do not really know their performance. The 1000 genomes project is compiling a benchmark. That will help to tell us which is the best.

        EDIT: Jonathan, as I do not work with SOLiD data routinely, do you happen to have a small example such as I can debug bwa? Thanks.

        Comment


        • #5
          Originally posted by lh3 View Post
          EDIT: Jonathan, as I do not work with SOLiD data routinely, do you happen to have a small example such as I can debug bwa? Thanks.
          I'll try to compile a small set that captures these errors;

          Best
          -Jonathan

          Comment


          • #6
            (as reported on mailing list)

            I've had crashes when dealing with cigar strings that contain indels at the end (eg 1I36M or 36M1I) and I think sometimes there are issues with insertions and deletions neighbouring each other (10M1I1D26M), but I'm less sure on the latter case.

            I use samtools as a C library too and found a few oddities here and there (mostly fixed as I've reported them - thanks Heng). One more recent one, but perhaps expecting too much, is how the pileup interface works with unmapped data. We get the option to control the bit-flag for sequences being masked out, but it explicitly adds BAM_FUNMAP regardless. I suspect there are reasons why this is incompatible with pileup though and my usage is a little, well, odd shall we say!

            Finally (also sent to samtools-dev), could we get a proper make install target for samtools so developers linking against it get a better way of picking up the header files and library, as well as just the binary. Autoconf would help a lot, but I know it's painful to set up.

            James

            Comment


            • #7
              Hi heng

              Though most people are using unix, but for a complete solution, I think it's beter to fix bugs in windows. I have found a bug that the windows version of samtools generate a index file different with unix version. As if it use \r\n as line seperator in windows, but use \r in linux.Use such index file will cast exception in picard.

              Thanks for all your great work.

              Comment


              • #8
                @james

                So far as I know, only bfast generates CIGAR like this and I do not know if it still generates them. Perhaps it also causes problems with IGV/GATK. Nonetheless, it would be good to support this.

                Pileup for filtering out unmapped reads. It is a little dangerous to allow this because to show reads in pileup, you also need CIGAR while few aligners fakes a CIGAR for unmapped reads.

                Someone has sent me an autoconf, but on the first try I failed to make it working. Possibly it is based on an older version of samtools while I was trying to import to the trunk.

                @houhuabin

                It is hard for me to debug any problems in Windows as I do not have a Windows machine. I will look into this when I can find one. At the same time, I guess too few people process data in Windows. It is just not the right OS for command-line tools. Thank you anyway.

                Comment


                • #9
                  Apropos the Windows issues, the broken-bai-index problem was fixed in subversion a couple of months ago. I've analysed the remaining \r\n inconsistencies (the index one was the major one), and have a plan for fixing them.

                  Unfortunately the changes needed to do this properly are annoyingly invasive for something so trivial. It's hard to find the motivation to get this done because of the risk of destabilising the source code and for the reasons Heng mentioned and others. But if people want it I can get it done.

                  Comment


                  • #10
                    Slight correction to the fix....

                    A slight correction to the fix, the conversion from -1 to 0 has to occur BEFORE the first quality value is trimmed... or it will fail to trim if the first quality value is -1.
                    -- Brad

                    New Code:
                    Code:
                              [B]s/-1\s*/chr(33)/eg;
                              s/^(\d+)\s*//;[/B]
                              s/(\d+)\s*/chr($1+33)/eg; #does not work for -1 <- those appear when sequence is ./N!
                    Old Code:
                    Code:
                              [B]s/^(\d+)\s*//;
                              s/-1\s*/chr(33)/eg;[/B]
                              s/(\d+)\s*/chr($1+33)/eg; #does not work for -1 <- those appear when sequence is ./N!

                    Originally posted by Jonathan View Post
                    Sofar I've run in major trouble running bwa+samtools for ABI-Colorspace data;

                    For one, the conversion script solid2fastq.pl has problems converting quality values of -1 to something correct:

                    example: qv-line
                    Code:
                    -1 32 -1 32 -1
                    5 qualities, for 5 color calls obviously.
                    the following conversion code (taken from the latest bwa-release pl-scripts)
                    Code:
                    .
                    .
                    
                              tr/0123./ACGTN/;
                              my $s = $_;
                              $_ = <$fhq>;
                              [B]s/^(\d+)\s*//;
                              s/(\d+)\s*/chr($1+33)/eg;[/B]
                              $seq = qq/\@$name\n$s+\n$_\n/;
                    .
                    .
                    
                      return defined($seq)? ($key, $seq) : ();
                    }
                    has issues, as it will return something like
                    Code:
                    -"A-"A-"
                    (A's might be false, I'm not entirely sure what chr(65) is), but, the gist:
                    We suddenly have 8 quality values.
                    "bwa samse [...]" will definitly step onto your toes there, by segfaulting without giving you a hint about what is going on. Took me one day to resolve.

                    Can be fixed by replacing the bold part from above with
                    Code:
                              s/^(\d+)\s*//;
                              s/-1\s*/chr(33)/eg;
                              s/(\d+)\s*/chr($1+33)/eg; #does not work for -1 <- those appear when sequence is ./N!
                    Any Idea?
                    Best
                    -Jonathan

                    Comment


                    • #11
                      This seems to be a bug, and I didn't find any info about it in the samtools help mailing list...

                      In the 5th column of the pileup file (the list of base reads), it seems that after every indel, an extraneous "." or a "," is inserted. These look like reference reads, but must not be, because if you count each of these as a read, you get a total count much larger than the read count shown in the 4th column.

                      Here's are a couple of examples:
                      chr17 4544978 G 13 .$..+3AGA.,+3aga.+3AGA,+3aga,+3aga,+3aga,+3aga,+3aga,a ##?@G3FG?F?@B
                      chrX 119387833 C 13 ..+3TGA..+3TGA.+3TGA..+3TGA,+3tga,+3tga.+3TGAga^]A 7B-=D#EB1G+#G

                      The software I'm using for indel detection (VarScan) works around this by ignoring any period or comma that follows an indel, but I'm afraid that will sometimes ignore legitimate reference reads...

                      (Aside from this, I've never had any problems with samtools.)
                      Thanks!

                      -Kris

                      Comment


                      • #12
                        this is NOT a bug. In the first example, there are 5 dots, 7 commas and one "a", exactly 13 bases. AGAs are inserted after chr17:4544978. They are not covering that position at all.

                        Comment


                        • #13
                          Ah, I see. Sorry! Thank you very much for the explanation.
                          -Kris

                          Comment


                          • #14
                            Related VarScan Question

                            Originally posted by kweber2 View Post
                            The software I'm using for indel detection (VarScan) works around this by ignoring any period or comma that follows an indel, but I'm afraid that will sometimes ignore legitimate reference reads...
                            Kris -
                            I'm using samtools with VarScan as well and I believe VarScan might have a bug because it ignores the period/comma following an indel. For example:

                            pileup = gi|348|RAIL|132 102 G 6 ...-3AGA.-3AGA.. ~~~~~~

                            VarScan .cns Output = gi|348|RAIL|132 102 G G 4 2 33.33% 1 1 93 93 0.22727272727272704 DEL-3-AGA

                            VarScan counts the 2 deletions between ref base 102 and 103 as 2 of the 6 reads at that base, even though there are actually 6 Gs at 102. Have you run in to anything like this? This seems definitely incorrect to me. I don't mean to hijack this thread (I haven't had a single problem using samtools and think its easy to use and understand), but you're the first person I've seen comment on VarScan in this way...thoughts? Is this a bug/error?
                            Thanks! SH1

                            Comment


                            • #15
                              Yes, I believe this is a bug in VarScan. The comments in the source code say that it is correcting the "bug" in samtools pileup by ignoring periods and commas that follow indels. I will alert the VarScan author about this issue today.

                              Also, if you are using a distributed jar file, make sure you are using the latest one (v2.2.2), because it corrects a bug in the previous version. (It was a bug that caused it to miss certain categories of LOH and indels.)

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X