Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sam files convert to bam files error

    hi all,

    when I use samtools to get bam file from sam file? I met the following problems:
    samtools view -h -F 4 -q 1 -bS C.filsa.sam >C.filsa.bam
    [samopen] SAM header is present: 7 sequences.
    [sam_read1] reference 'SR' is recognized as '*'.
    [main_samview] truncated file.

    I also met "missing colon in auxiliary data " and "CIGAR and sequence length are inconsistent" in individual rows. My sam files came from the results of gsnap. I am not sure these problem caused by gsnap or samtools. how can i deal with them?

    Any suggestions and answers are appreciated. thank you.

  • #2
    The following is my sam sample. I don't understand where is the reference 'SR'?
    SRR019035.130 16 Chr5 9804788 40 36M * 0 0 CAGCCTCAAACGGCGCCGTCTTATACGGTGAGTTAC IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1 NM:i:0
    SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.131 16 Chr1 753661 40 30M * 0 0 TGAAGATATTGAACCTCTCCGTTAGGGAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:30 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40
    X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.132 16 Chr3 7844307 40 36M * 0 0 ATGCTGGTAATTCACGAGCTTGATGAAACATTTCAC I3IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1 NM:i:0
    SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.133 0 Chr1 28835502 40 36M * 0 0 GTTTTAGTTTCGTCTGCAACTGAGTCATCACCTACT IIIIIIIIIIIIIIIIIIIIIIDIIIIIIDIII-II MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.134 0 Chr1 28836313 40 36M * 0 0 GAAAATTTCAGGTCTGGTTCAGAATTGGTTCCGAAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7II MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.135 0 Chr5 22542176 40 25M * 0 0 CGTGGTTCTAGGACATCATCTGATA IIIIIIIIIIIIIIIIIIIIIIIII MD:Z:25 NH:i:1 HI:i:1 NM:i:0 SM:i:40
    XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.136 0 ChrC 100327 3 36M * 0 0 GAATAAAGGATTAATCCGTATCATCTTGACTTGGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:2 HI:i:1 NM:i:0
    SM:i:3 XQ:i:40 X2:i:40 XO:Z:UM PG:Z:A
    SRR019035.136 272 ChrC 138287 3 36M * 0 0 AACCAAGTCAAGATGATACGGATTAATCCTTTATTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:2 HI:i:2 NM:i:0
    SM:i:3 XQ:i:40 X2:i:40 XO:Z:UM PG:Z:A
    SRR019035.137 16 Chr1 28835623 40 36M * 0 0 TATTTTCGTCGTCTCTAGAGTTTGAAGCATCAGTCC IIBI61IIIIIHIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.138 16 Chr5 19304066 40 36M * 0 0 ATCAATGATATGTTTAAGCAAGACGACTCTTTCAGC IIIII?IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.139 0 Chr4 162871 40 26M * 0 0 TGATTTCGTTGTGCTATGTAAACTTT IIIIIIIIIIIIIIIIIIII1IIIII MD:Z:26 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40
    X2:i:0 XO:Z:UU PG:Z:A

    Comment


    • #3
      The SR... stuff is just the name of the read, which I see you downloaded from SRA (or ENA). Out of curiousity, what happens if you just:

      Code:
      samtools view -F 0x4 -q 1 -Sbo C.filsa.bam C.filsa.sam
      I wonder if giving the -h option is just screwing things up (it shouldn't do anything when you write a BAM file).

      Comment


      • #4
        Thanks dpryan.
        I try your code, but "reference 'SR' is recognized as '*'.” still occurred. my SRA data download from http://www.ncbi.nlm.nih.gov/sra/?term=SRR019035。

        Comment


        • #5
          If the first 1000 lines or so are sufficient to reproduce this, could you attach that (you have to edit in "advanced" mode and click on the paperclip)? That'd provide a reproducible example. To get the first 1000 (or whatever) lines, just:

          Code:
          head -n 1000 file.sam > excerpt.txt

          Comment


          • #6
            I try the first 1000 raws, It's no problem. So I attach the first 500 raws and the tail 500 raws for you. but I am not sure the problems will appear.

            Every time, when I deal with large sam files, only very few lines has some problems such as 'missing colon in auxiliary data' or 'CIGAR and sequence length are inconsistent', but these two problem always illustrate the specific lines and I could found the problems. Only 'reference *** is recognized as '*‘’,I couldn't found which lines have problems?

            because my sam files are got from gsnap alignment. So I am confused the problems are caused from the gsnap or samtools? if they are caused by gsnap, 99% data is OK. how can I avoid these problem and filter these low quality data in advance.
            Attached Files

            Comment


            • #7
              That doesn't seem to reproduce the problem either. It's very likely that the problem is with gsnap, which apparently is producing corrupt output on occasion. You might consider upgrading if that's an option or report the issue to the developer.

              Comment


              • #8
                Thank you for your good advise, It indeed help me.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X