Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sam files convert to bam files error

    hi all,

    when I use samtools to get bam file from sam file? I met the following problems:
    samtools view -h -F 4 -q 1 -bS C.filsa.sam >C.filsa.bam
    [samopen] SAM header is present: 7 sequences.
    [sam_read1] reference 'SR' is recognized as '*'.
    [main_samview] truncated file.

    I also met "missing colon in auxiliary data " and "CIGAR and sequence length are inconsistent" in individual rows. My sam files came from the results of gsnap. I am not sure these problem caused by gsnap or samtools. how can i deal with them?

    Any suggestions and answers are appreciated. thank you.

  • #2
    The following is my sam sample. I don't understand where is the reference 'SR'?
    SRR019035.130 16 Chr5 9804788 40 36M * 0 0 CAGCCTCAAACGGCGCCGTCTTATACGGTGAGTTAC IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1 NM:i:0
    SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.131 16 Chr1 753661 40 30M * 0 0 TGAAGATATTGAACCTCTCCGTTAGGGAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:30 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40
    X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.132 16 Chr3 7844307 40 36M * 0 0 ATGCTGGTAATTCACGAGCTTGATGAAACATTTCAC I3IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1 NM:i:0
    SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.133 0 Chr1 28835502 40 36M * 0 0 GTTTTAGTTTCGTCTGCAACTGAGTCATCACCTACT IIIIIIIIIIIIIIIIIIIIIIDIIIIIIDIII-II MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.134 0 Chr1 28836313 40 36M * 0 0 GAAAATTTCAGGTCTGGTTCAGAATTGGTTCCGAAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7II MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.135 0 Chr5 22542176 40 25M * 0 0 CGTGGTTCTAGGACATCATCTGATA IIIIIIIIIIIIIIIIIIIIIIIII MD:Z:25 NH:i:1 HI:i:1 NM:i:0 SM:i:40
    XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.136 0 ChrC 100327 3 36M * 0 0 GAATAAAGGATTAATCCGTATCATCTTGACTTGGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:2 HI:i:1 NM:i:0
    SM:i:3 XQ:i:40 X2:i:40 XO:Z:UM PG:Z:A
    SRR019035.136 272 ChrC 138287 3 36M * 0 0 AACCAAGTCAAGATGATACGGATTAATCCTTTATTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:2 HI:i:2 NM:i:0
    SM:i:3 XQ:i:40 X2:i:40 XO:Z:UM PG:Z:A
    SRR019035.137 16 Chr1 28835623 40 36M * 0 0 TATTTTCGTCGTCTCTAGAGTTTGAAGCATCAGTCC IIBI61IIIIIHIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.138 16 Chr5 19304066 40 36M * 0 0 ATCAATGATATGTTTAAGCAAGACGACTCTTTCAGC IIIII?IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.139 0 Chr4 162871 40 26M * 0 0 TGATTTCGTTGTGCTATGTAAACTTT IIIIIIIIIIIIIIIIIIII1IIIII MD:Z:26 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40
    X2:i:0 XO:Z:UU PG:Z:A

    Comment


    • #3
      The SR... stuff is just the name of the read, which I see you downloaded from SRA (or ENA). Out of curiousity, what happens if you just:

      Code:
      samtools view -F 0x4 -q 1 -Sbo C.filsa.bam C.filsa.sam
      I wonder if giving the -h option is just screwing things up (it shouldn't do anything when you write a BAM file).

      Comment


      • #4
        Thanks dpryan.
        I try your code, but "reference 'SR' is recognized as '*'.” still occurred. my SRA data download from http://www.ncbi.nlm.nih.gov/sra/?term=SRR019035。

        Comment


        • #5
          If the first 1000 lines or so are sufficient to reproduce this, could you attach that (you have to edit in "advanced" mode and click on the paperclip)? That'd provide a reproducible example. To get the first 1000 (or whatever) lines, just:

          Code:
          head -n 1000 file.sam > excerpt.txt

          Comment


          • #6
            I try the first 1000 raws, It's no problem. So I attach the first 500 raws and the tail 500 raws for you. but I am not sure the problems will appear.

            Every time, when I deal with large sam files, only very few lines has some problems such as 'missing colon in auxiliary data' or 'CIGAR and sequence length are inconsistent', but these two problem always illustrate the specific lines and I could found the problems. Only 'reference *** is recognized as '*‘’,I couldn't found which lines have problems?

            because my sam files are got from gsnap alignment. So I am confused the problems are caused from the gsnap or samtools? if they are caused by gsnap, 99% data is OK. how can I avoid these problem and filter these low quality data in advance.
            Attached Files

            Comment


            • #7
              That doesn't seem to reproduce the problem either. It's very likely that the problem is with gsnap, which apparently is producing corrupt output on occasion. You might consider upgrading if that's an option or report the issue to the developer.

              Comment


              • #8
                Thank you for your good advise, It indeed help me.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 02:46 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-02-2024, 08:06 AM
                0 responses
                23 views
                0 likes
                Last Post seqadmin  
                Working...
                X