Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error indexing BAM file using samtools

    Hi,

    I downloaded a bam file from NCBI and am unable to index it. Here is what I've done:
    samtools index file.bam

    Error message:
    [bam_header_read] EOF marker is absent.

    I haven't really found anything in thread archives that give this error with a bam file so am pretty sure I'm doing something fundamentally wrong. I'm a samtools and bam file newbie so any help would be much appreciated!

    Thanks!

  • #2
    Originally posted by veena View Post
    Hi,

    I downloaded a bam file from NCBI and am unable to index it. Here is what I've done:
    samtools index file.bam

    Error message:
    [bam_header_read] EOF marker is absent.

    I haven't really found anything in thread archives that give this error with a bam file so am pretty sure I'm doing something fundamentally wrong. I'm a samtools and bam file newbie so any help would be much appreciated!

    Thanks!
    Your fine as it is just a warning. The new implementations of samtools and picard add an EOF marker. Earlier BAMs did not have these.

    Nils

    Comment


    • #3
      Thanks Nils!
      Another newbie question, I'm trying to get a subset of reads from publicly available unmapped data that align to my sequence of interest.

      I'm told that a read's sequence should be available in a BAM file. But isnt a BAM file by definition an alignment file (as in aligned-to-something file) to begin with? Can I run another alignment program (say Blast) on a pre-existing BAM file with a completely different query? Very confused and would appreciate any help!

      Comment


      • #4
        Also, any help on how to run BLAST on a BAM file would be much appreciated!

        Comment


        • #5
          Originally posted by veena View Post
          Thanks Nils!
          Another newbie question, I'm trying to get a subset of reads from publicly available unmapped data that align to my sequence of interest.

          I'm told that a read's sequence should be available in a BAM file. But isnt a BAM file by definition an alignment file (as in aligned-to-something file) to begin with? Can I run another alignment program (say Blast) on a pre-existing BAM file with a completely different query? Very confused and would appreciate any help!
          The SAM format has support for reads that are not aligned. For example, if one end of a paired end read does not map, it can be flagged as unmapped and given the co-ordinate of the other end. I would study the SAM spec carefully. By filtering on the FLAG field, you can pull out reads that are unmapped (assuming that the aligner was kind enough to include unmapped reads).

          To run BLAST on a BAM file, you would have to convert the BAM file into whatever format (FASTA?) BLAST requires. This can be done with a quick script or bugging your local bioinformatician.

          Comment


          • #6
            Thanks so much again Nils! The scary thought is I'm the "local bioinformatician" and I've googled my fingers silly trying to figure out how to get a fasta (thats all I really need!) from the publicly available .bam file. Nobody else around me cares to work with .bam files (yet). Is it best to convert from bam to sam and then format read name and sequence into fasta? Or is there a better way?

            Comment


            • #7
              Originally posted by veena View Post
              Thanks so much again Nils! The scary thought is I'm the "local bioinformatician" and I've googled my fingers silly trying to figure out how to get a fasta (thats all I really need!) from the publicly available .bam file. Nobody else around me cares to work with .bam files (yet). Is it best to convert from bam to sam and then format read name and sequence into fasta? Or is there a better way?
              Look at Picard's SamToFastq.jar. That will get you to FASTQ and then smooth sailing to FASTA. Alternatively, you can use the many APIs (PERL, Python, C, Java, etc.) to natively read in SAM/BAM. I have personally used all of them successfully.
              Last edited by nilshomer; 03-03-2010, 08:55 PM. Reason: speak and spell failed

              Comment


              • #8
                Thanks Nils, I'll give it a try!

                Comment


                • #9
                  Using Picard's tool is probably better, but it's worth studying the line below as an example as an example of a very quick-and-dirty SAM-to-FASTA generator

                  Code:
                  samtools view myalign.bam | perl -n -e 'if (/^\@/) { @f=split(/\t/); print ">$f[0]|$f[1] $f[2]:$f[3]\n$f[9]\n"; }'

                  I used the flag field to disambiguate the two ends of a read

                  (any bugs were clearly deliberate attempts to educate the student! :-)

                  Comment


                  • #10
                    Thats what I get for not readig the manual well enough thanks krobison! And disclaimer duly noted!
                    Last edited by veena; 06-03-2010, 12:38 PM. Reason: being an idiot!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin


                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                      Yesterday, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    39 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    41 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    35 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    55 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X