Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help in converting Sam to Bam!

    Hello again everyone...

    So, today I was happy as I got all my fastq files aligned and thought today's a beautiful day and all will go when I want to finally get some analysis out of my sam file (mind you, this is the very first time I am doing this).....so I started (well, attempted to start) converting my sam to bam files using 1) samtools 2) picard SortSam.jar...neither method works and I would really appreciate your input regarding this. I searched this great forum but nobody seemed to have the problem that I am having. Here are the details:


    1- Using samtools.

    command: samtools view -bS $REF mysamfile.sam > bamfile.bam

    output:
    samopen] SAM header is present: 84 sequences.
    [sam_read1] reference 'AS:human_g1k_v37' is recognized as '*'.
    Parse error at line 163: invalid CIGAR character
    Aborted

    So samtools view basically halted when it encountered * in the cigar string.

    For this case, should I remove the alignments with an unknown cigar string? if that's a good solution, how do I do it?

    2- Using picard tools SortSam.jar

    java -jar /share/apps/picard-tools_current/jar/SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT INPUT=0099758_1.fastqnovo.raw.sam OUTPUT=0099758_1.bam

    output:

    [Wed Jun 08 14:17:31 BST 2011] net.sf.picard.sam.SortSam INPUT=0099758_1.fastqnovo.raw.sam OUTPUT=0099758_1.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=/tmp/zibrahimbrc VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
    [Wed Jun 08 14:17:31 BST 2011] net.sf.picard.sam.SortSam done.
    Runtime.totalMemory()=2058027008
    Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; Line 163
    Line: @SQ SN:1 AS:human_g1k_v37 LN:249250621
    at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(SAMTextReader.java:213)
    at net.sf.samtools.SAMTextReader.access$400(SAMTextReader.java:40)
    at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:305)
    at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268)
    at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240)
    at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:609)
    at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:587)
    at net.sf.picard.sam.SortSam.doWork(SortSam.java:58)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:117)
    at net.sf.picard.sam.SortSam.main(SortSam.java:66)

    so I am guessing that it didn't like my SQ field. The thing is, I didn't generate it, it was generated automatically by novoalign (what I used for the alignment). Should I specify @SQ fields when running novoalign (and run it again)? if so, what should I put it in it?

    Here's how I ran my novoalign:

    novoalign -d /scratch/data/reference_genomes/human/human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM $"@RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1" 2> 0099758_1.fastqnovo.stats > 0099758_1.fastqnovo.raw.sam

    basically specifying the RG field only in the header.



    -----------------------------------------


    Also, this is the output of "samtools view -SH GSM520_ES.aligned.sam

    samopen] SAM header is present: 84 sequences.
    @HD VN:1.0 SO:unsorted
    @RG ID:0099758_1.fastq PL:sanger LB:epilepsy SM:0099758_1
    @PG ID:novoalign VN:V2.07.09 CL:novoalign -d human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM @RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1
    @HD VN:1.0 SO:unsorted
    @RG ID:0099758_1.fastq PL:sanger LB:epilepsy SM:0099758_1
    @PG ID:novoalign VN:V2.07.09 CL:novoalign -d human_g1k_v37 -F STDFQ -f 0099758_1_1.fastq 0099758_1_2.fastq -r Random -K 0099758_1 -o SAM @RG\tID:0099758_1.fastq\tPL:sanger\tLB:epilepsy\tSM:0099758_1
    @SQ SN:1 AS:human_g1k_v37 LN:249250621
    @SQ SN:2 AS:human_g1k_v37 LN:243199373
    @SQ SN:3 AS:human_g1k_v37 LN:198022430
    @SQ SN:4 AS:human_g1k_v37 LN:191154276
    @SQ SN:5 AS:human_g1k_v37 LN:180915260
    @SQ SN:6 AS:human_g1k_v37 LN:171115067
    @SQ SN:7 AS:human_g1k_v37 LN:159138663
    @SQ SN:8 AS:human_g1k_v37 LN:146364022
    @SQ SN:9 AS:human_g1k_v37 LN:141213431
    @SQ SN:10 AS:human_g1k_v37 LN:135534747
    @SQ SN:11 AS:human_g1k_v37 LN:135006516
    @SQ SN:12 AS:human_g1k_v37 LN:133851895
    @SQ SN:13 AS:human_g1k_v37 LN:115169878
    @SQ SN:14 AS:human_g1k_v37 LN:107349540
    @SQ SN:15 AS:human_g1k_v37 LN:102531392
    @SQ SN:16 AS:human_g1k_v37 LN:90354753
    @SQ SN:17 AS:human_g1k_v37 LN:81195210
    @SQ SN:18 AS:human_g1k_v37 LN:78077248
    @SQ SN:19 AS:human_g1k_v37 LN:59128983
    @SQ SN:20 AS:human_g1k_v37 LN:63025520
    @SQ SN:21 AS:human_g1k_v37 LN:48129895
    @SQ SN:22 AS:human_g1k_v37 LN:51304566
    @SQ SN:X AS:human_g1k_v37 LN:155270560
    @SQ SN:Y AS:human_g1k_v37 LN:59373566
    @SQ SN:MT AS:human_g1k_v37 LN:16569
    @SQ SN:GL000207.1 AS:human_g1k_v37 LN:4262
    @SQ SN:GL000226.1 AS:human_g1k_v37 LN:15008
    @SQ SN:GL000229.1 AS:human_g1k_v37 LN:19913
    @SQ SN:GL000231.1 AS:human_g1k_v37 LN:27386
    @SQ SN:GL000210.1 AS:human_g1k_v37 LN:27682



    ------------------------------------------------

    Any help would be appreciated . Thank you!

  • #2
    Hi zhacker,

    I have a question about how you ran "samtools view" . . .

    command: samtools view -bS $REF mysamfile.sam > bamfile.bam

    output:
    samopen] SAM header is present: 84 sequences.
    [sam_read1] reference 'AS:human_g1k_v37' is recognized as '*'.
    Parse error at line 163: invalid CIGAR character
    I don't think you need to include $REF in the command (assuming that is an alias to your reference file). It looks like your SAM file already includes the sequence names in the header. Including a file of the reference names would only be needed if there is no SAM header.

    Also, for some reason it is thinking "AS:human_g1k_v37" is the name of one of the references. It sees that this name is not in your reference list and therefore puts it as a "*" (unknown). So something is odd about that too. Maybe try running the "samtools view" command without $REF and see if that works. Also, maybe post what line 163 is since that seems to be giving samtools and Picard trouble.

    good luck,
    BAMseek

    Comment


    • #3
      Thank you BAMseek..

      I ran samtools view without $REF and it gave me the same error.

      As for line 163, it's another @SQ field :

      163 @SQ SN:1 AS:human_g1k_v37 LN:249250621


      But what's interesting about it is that it's the first @SQ line that comes after the first alignment section:


      159 1:1:1191:18922:Y 89 11 31320263 119 36M = 31320263 0 CTCAGTCTACTTCAAAAAAGCCAATGTCTGTCACAA ??C5=BA=>?C5>AA>??=:A::CDDDD:CC?AC PG:Z:novoalign RG:Z:0099758_1.fastq LB:Z:epilepsy AS:i:0 UQ:i:0 NM:i:0 MD:Z:36
      160 1:1:1191:18922:Y 165 11 31320263 0 * = 31320263 0 AGGTGGTGACTGGACNCTNGNNAGNNNCTAAGNTTN 6@>>@-A:5A>BB####################### PG:Z:novoalign RG:Z:0099758_1.fastq LB:Z:epilepsy ZS:Z:QC
      161 1:1:1192:5316:Y 73 11 31439373 119 36M = 31439373 0 ACAGCCATATATTTATCTAAAAGATAATTTAAAGGG FF:B?FFBDFDDDFEFFFDFEDE>>DEEEB=:EEEA PG:Z:novoalign RG:Z:0099758_1.fastq LB:Z:epilepsy AS:i:0 UQ:i:0 NM:i:0 MD:Z:36
      162 1:1:1192:5316:Y 133 11 31439373 0 * = 31439373 0 TCCAAAGTGCTACCTNAANANNAGNNNAATCTNCAN F=EDFBEEE:C@@>@##################### PG:Z:novoalign RG:Z:0099758_1.fastq LB:Z:epilepsy ZS:Z:QC

      163 @SQ SN:1 AS:human_g1k_v37 LN:249250621
      164 @SQ SN:2 AS:human_g1k_v37 LN:243199373
      165 @SQ SN:3 AS:human_g1k_v37 LN:198022430
      166 @SQ SN:4 AS:human_g1k_v37 LN:191154276
      167 @SQ SN:5 AS:human_g1k_v37 LN:180915260
      168 @SQ SN:6 AS:human_g1k_v37 LN:171115067
      169 @SQ SN:7 AS:human_g1k_v37 LN:159138663
      170 @SQ SN:8 AS:human_g1k_v37 LN:146364022
      171 @SQ SN:9 AS:human_g1k_v37 LN:141213431
      172 @SQ SN:10 AS:human_g1k_v37 LN:135534747

      Pardon me if this sounds stupid (this is my first time looking at sam files ) but my .sam file looks like:

      header
      sq fields
      alignment section
      sq fields
      alignment section
      .....


      line 163 is where the second batch of alignment (2nd alignment section) begins. Is there a mismatch between the number of alignments and @SQ fields in the first batch? I don't know, is this even possible? Thank you again.....

      Comment


      • #4
        Now that I think about it, it really seems that instead of an SQ line, it was waiting for an 'alignment' line
        ...now, do I fix that by running novoalign again and instead of letting it generate the SQ
        field I do it myself (if that's even an option)? or can I just remove the SQ fields from my sam file and run samtools with the -bt option instead? thank you all in advance.

        Comment


        • #5
          Yeah, that is weird. I think the header lines (those beginning with @) should all appear at the top, but for some reason you have header lines after the alignments start. I would think the programs you are using are trying to parse the header line as an alignment line and it is complaining (rightfully) that it isn't in the right format. If samtools and Picard both complain, I would think there is something wrong with the files - are they somehow multiple sam files concatenated together?

          Comment


          • #6
            yes I am realigning now to see if I get a proper sam file. I am using novoalign on paired-end data...I think what happened is that the aligner processed the files sequentially and placed its results in the file (very strange). the paired-end option was default so I didn't include it in the run, this time I made it explicit (hoping that this would fix the issue)..I'll let you know the outcome when the alignment's done! Thank you very much.

            Comment


            • #7
              Hi zhacker,

              Were you able to figure things out? I haven't had any experience with Novoalign so probably wouldn't be much help there. It seems unusual that they would place the output in a non-standard way, since downstream tools would rely on the files to be in a specific form. If they did indeed just concatenate all the files together, then you could probably create a script to break them back up - in case that is easier than doing the alignment again.

              Comment


              • #8
                Hi BAMSeek,

                Thank you for the follow-up. I am re-aligning one of my .sam files just to see if it will align it properly this time and have written a script to reorganize the file. The thing is, I have paired-end data and novoalign assumes that by default if you don't tell it otherwise. So in my first attempt for alignment, I left that option out (thinking that it will use the default value...this time I'm putting it explicitly. I will be working on this on the weekend and will post my findings here in case they're of use to others.

                Comment


                • #9
                  Sam file with @PG

                  Hi,
                  I am trying to convert sam files with a header @PG
                  I tried this:
                  samtools view -bStH myfile.sam

                  Got the following error:

                  [main_samview] fail to open "myfile.sam" for reading.

                  Any ideas what the problem could be?

                  Comment


                  • #10
                    Hi SDBP,

                    Did you figure out the answer to your question yet?

                    [main_samview] fail to open "myfile.sam" for reading.
                    It sounds like samtools cannot find the file. Try specifying the entire path to the sam file or make sure you are in the folder that contains the file when you execute the command.

                    Comment


                    • #11
                      I did fix the segmentation fault and tried to do the SAM to BAM conversion again.
                      It ran but showed the following on the screen:
                      [sam_read1] reference 'chr11.fa' is recognized as '*'.

                      After learning from another thread on this forum, I did the following:
                      sed s/.fa// on the input file before doing export2sam.pl

                      now when I run export2sam.pl, I get this:

                      ERROR: Unexpected number of fields in export record on line 285 of read1 export file. Found 21 fields but expected 22.
                      ...erroneous export record:
                      ABC-DE2 1 4 1 3 1347 0 1 TTTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB QC

                      Any insight would help.

                      Comment


                      • #12
                        Hi SDBP,

                        I have seen this issue too. The ELAND file should have 22 tab-delimited fields, but if the last field or so is empty, then there might only be 21 fields. Not sure if this happens as output from ELAND or if something gets lost when the file is copied and pasted around. Maybe try modifying the converter script to check to see if there are at least 21 fields instead of requiring 22. Hope that helps.

                        BAMSeek

                        Comment


                        • #13
                          Hello,

                          I am having a similar issue where using this samtools command:
                          samtools view -bS -t abyss.fa.fai /data/storage_fast/J_WD/bowtie/abyss_nosq.sam nosq.bam

                          I get this error: [main_samview] fail to open "/data/storage_fast/J_WD/bowtie/abyss_nosq.sam" for reading.

                          I tried BAMseek's suggestion however it did not solve the problem. Does any one else have a suggestion?

                          Thank you in advance.

                          Originally posted by BAMseek View Post
                          Hi SDBP,

                          Did you figure out the answer to your question yet?



                          It sounds like samtools cannot find the file. Try specifying the entire path to the sam file or make sure you are in the folder that contains the file when you execute the command.
                          jdjax
                          Ph.d. Student
                          Åarhus University

                          Comment


                          • #14
                            Hi everyone and sorry for the late reply. My problem was with Novoalign, I was using the mpi-version which really ruined the generated sam files. Their header and alignment sections were allover the file. I redid the alignment without using mpi..it was much slower but the generated sam files were intact. I'm guessing that not too many people use novoailgn because of the license, just thought I'd let you know.

                            Comment


                            • #15
                              you forgot to put in the filename using the -o command line argument. Just adding the output file name after command wouldn't work.
                              It might work with changing the output using the stream operator > but not sure about that

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X