Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Have you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.

    Comment


    • #17
      Originally posted by lh3 View Post
      Have you sorted the alignment? Indexing only works for sorted alignment. Also remember to use the latest bwa. The old version may generate some funny alignments, though this happens very rarely.
      I hadn't sorted it before. Now I ran "samtools sort", then "samtools index" on the sorted output. It resulted the same with seg fault.
      I am using bwa version 0.4.5. Is there a newer svn version?
      samtools index works without issue on converted MAQ alignments.

      Comment


      • #18
        I imported an ELAND alignment and was able to convert into SAM, then to BAM, then sort it. However, at the indexing step I too ran into a segmentation fault. I'm using the 0.1.2 version from the download page, not from SVN. Any suggestions?

        Comment


        • #19
          Have you sorted the alignment first? Indexing in 0.1.2 has a bug, but should not cause segfault. Thanks.

          Comment


          • #20
            Yes, I sorted it just fine. In fact the indexing step will complain that the file isn't sorted.

            One issue could be that I just realized that the ref_list file I gave during the import didn't have the reference size in it. I assume this means the length of the reference sequence? I'll have to give that a try when I first import the file (convert to bam).

            Comment


            • #21
              You can generate ref_list file by running "bam faidx" on your reference sequence. The index file can be used with import. Note that faidx allows you to quickly extract subsequence from the genome, which may be useful to you.

              Comment


              • #22
                Thanks for the quick replies. I've tried with various values, etc. and the indexing step still seg faults. Any other ideas on debugging this? Perhaps using an older version? Thanks for any ideas.

                Comment


                • #23
                  I have been looking at the SAM format to see if it is something we should consider for the output of the mapping for a genome assembly we are doing at Complete Genomics. As you may know, we have quite an unusual read structure that may be difficult to represent in SAM (5+10+10+10 times two, mate paired reads). The problem lies in the fact that the there is some overlap between the first 5 base read and the second 10 bases (we call them negative gaps).

                  The read could be

                  acgtc tcgattgcgg ...

                  which maps to the reference like this

                  acgTCgattgcc...

                  The capital TC show the negative gaps where there were actual overlaps in the read sequence.

                  Anyone now how we could represent this in SAM? Can the CIGAR standard deal with negative numbers? 5M-2M8M ?)

                  Should we map the 5 base read and the other 30 bp read as two separate reads? But we also have mate pairs, so how should we represent those? And would any other tool be able to deal with a read structure like this?

                  Thanks,

                  Thon
                  Complete Genomics
                  Thon
                  __________________________________
                  Thon de Boer, Ph.D.
                  Director of Product Management, Software
                  Strand Life Sciences
                  548 Market Street, Suite 82804
                  San Francisco, CA 94104, USA
                  [email protected]
                  www.strandls.com
                  Pioneers in Discovery Research Informatics
                  _______________________________________

                  Comment


                  • #24
                    Negtive length in CIGAR would be good, but that is not supported at the moment. Alternatively, you can save the read as acgtggattgcc and write the CIGAR as 11M. In this way, you cannot get the original read sequence, but I guess this is not so important in most cases. What do you think?

                    Comment


                    • #25
                      Well...The reads are independent and sometimes don't agree and this is something we need to capture so we could not just remove that information.

                      Is there any other way in SAM (now or future) that would allow us to capture our read structure? We are going to be producing thousands of genomes very soon and I'm sure many of our customers would want to use a format that is comparable to the one used in the 1000 genome project, but we also would want our read structure to be supported in that format...

                      Is there a way for us to get involved in the design of the SAM standard?

                      Thon
                      Complete Genomics
                      Thon
                      __________________________________
                      Thon de Boer, Ph.D.
                      Director of Product Management, Software
                      Strand Life Sciences
                      548 Market Street, Suite 82804
                      San Francisco, CA 94104, USA
                      [email protected]
                      www.strandls.com
                      Pioneers in Discovery Research Informatics
                      _______________________________________

                      Comment


                      • #26
                        Hello Thon,

                        You can join samtools' mailing lists which can be found here:



                        You may send around your suggestion and see what others respond. For the moment, I think you may save the read as "acgtggattgcc" and add an optional field to indicate that the first tg is actually an overlap between the first 5bp and the rest of read. In this way, you can use samtools without losing any information. Note that samtools will not look into the optional fields, but the information is kept anyway. Would this be enough for your application?

                        Comment


                        • #27
                          I copied my reply to the samtools-devel mailing list here. I think the following strategy is the best for data generated by Complete Genomics.

                          Now I prefer to store "acgtctcgattgcgg" as:

                          SEQ=acgtctcgattgcgg
                          CIGAR=5M2S8M (or 3M2S10M)

                          where S stands for "soft clipping/skip on the read". I have checked the source code and think the current "samtools pileup" supports internal soft clipping, which means pileup, consensus/indel calling and glf can be applied. Even if samtools does not support internal soft clipping now, I think it should not be hard to change the code.

                          Comment


                          • #28
                            Just wanted to let everyone know that BFAST (http://genome.ucla.edu/bfast) now supports SAM output. Congratulations Heng (I recognize your coding style from reading MAQ's source!), and others for creating a great tool (samtools) and a good start at an open format (SAM).

                            Nils

                            Comment


                            • #29
                              I got seg fault when doing "samtools pileup". The error message is "[bam_pileup_core] the input is not sorted. Abort!", however I do use the sorted input file. Anyone else seen this before?

                              Q

                              Comment


                              • #30
                                You should run "samtools sort" first. Pileup works on a stream without loading the whole alignment into the memory and therefore the alignments must be sorted on the chromosomal positions.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                81 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X