Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem running BWA sampe

    Maybe someone can see what I'm doing wrong here. I'm using BWA to align a dataset and get a SAM file that I can use for variant calling. I've indexed the reference fasta file with 'bwa index' and run 'bwa aln' to generate my .sai files. These steps seem to have worked, but when I try to use 'bwa sampe' to produce a .sam file, the process aborts and I get the following error message:

    '[bns_restore_core] fail to open file 'human_g1k_v37.fasta.ann'. Abort!'

    The command I'm using is:

    'bwa sampe -a 500 -o 100000 -n 3 -N 10 -P human_g1k_v37.fasta ERR035484_1.sai ERR035484_2.sai ERR035484_1.fastq ERR035484_2.fastq > PEA1.sam'

    I have a .ann file ('human_g1k_v37.ann') from the indexing step, and I thought it might be a filename problem, i.e. the program trying to open a .fasta.ann file instead of just .ann, but omitting the .fasta in the command still gave a similar error message.

    The first few lines of my .ann file look like this:

    3101804739 84 11
    0 1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
    0 249250621 39
    0 2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
    249250621 243199373 24
    0 3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
    492449994 198022430 9
    0 4 dna:chromosome chromosome:GRCh37:4:1:191154276:1
    690472424 191154276 12
    0 5 dna:chromosome chromosome:GRCh37:5:1:180915260:1

    I've searched the forum but not found a solution to this that works for me. Can anyone help me out?

  • #2
    The indexing step should have generated the file human_g1k_v37.fasta.ann instead of human_g1k_v37.ann . Renaming it to human_g1k_v37.fasta.ann (and doing the same for any of the other index files that are mysteriously misnamed) should fix the problem.

    Comment


    • #3
      Thanks Rocketnight, but I still get the error:

      [bam_header_read] invalid BAM binary header (this is not a BAM file).
      [bam_header_read] invalid BAM binary header (this is not a BAM file).
      [bns_restore_core] fail to open file 'human_g1k_v37.fasta.nt.ann'. Abort!

      The reads are from an Illumina GAIIx, so everything should be base space, and from what I've read, the '.nt' refers to colour space data.

      My alignment command was 'bwa aln -n 3 -o 1 -e -1 -d 16 -l 5 -k 2 - l 14 -t1 -M 3 -O 11 -E 4 -q 0' if that's any help. I'm using bwa/0.6.1.

      Any more suggestions would be greatly appreciated.

      Comment


      • #4
        Most of those command-line options are unneccessary - you're just setting values to their default values, which doesn't change much. Those options are only there to be tinkered with if you really know what you're doing, so the only thing you really need to run bwa aln is:

        $ bwa aln [genome.fasta] [reads.fq] > [output.sai]

        Secondly, bwa seems to be freaking out attempting to read a BAM file. Are you supplying reads in FASTQ format [.fastq(.gz) or .fq(.gz)] or .BAM format?

        Comment


        • #5
          The reads going into BWA are in fastq format, converted from the Sequence Read Archive's .sra format using the SRA Toolkit. I was puzzled by the BAM error too, but I can't see anything in what I've done so far that would lead it to expect a BAM file.

          Comment


          • #6
            Bump for any more input on what might be stopping bwa sampe from running. Any ideas?

            Why would BWA suddenly expect to see a BAM file when everything was supplied in fastq?

            Comment


            • #7
              Can you give me the first few lines of the FASTQ input file? (If it's gzipped, use gunzip -c [file] | head -n 8)

              Comment


              • #8
                Originally posted by Rocketknight View Post
                Can you give me the first few lines of the FASTQ input file? (If it's gzipped, use gunzip -c [file] | head -n 8)
                The first 8 lines are:

                @ERR035484.1 CRIRUN_408:3:9:10903:5069 length=72
                CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCTAACCCTAACCGCACCC
                +ERR035484.1 CRIRUN_408:3:9:10903:5069 length=72
                ;B;CD2@F??=BBCCFEFFFC?E4FF<FBFFG>AFDA9D>F::E@E9B2;>0?###################
                @ERR035484.2 CRIRUN_408:3:40:4508:13433 length=72
                CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
                +ERR035484.2 CRIRUN_408:3:40:4508:13433 length=72
                BB@GHIIHFHIIIIDHIIIIEGIIIICIIIIIGIIIIIFIIIIIHIIIGFEEIIIBEGIEEGEBEAD@>?;A

                Thanks for continuing to try to help, Rocketknight, it's appreciated.

                Comment


                • #9
                  Also, the first few lines of the fasta reference file are:

                  >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
                  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

                  (The N's give way to actual sequence soon after this.)

                  Comment


                  • #10
                    That looks fine, I can't see BWA mistaking it for a BAM file or anything. BWA is a relatively straightforward program, so I've no idea what could be causing this. All I can think of at this point is a complete clean run with the absolute minimum of command line options, in case some upstream step went wrong somehow.

                    Remove all the indexes and .sai files you've created. The only files you should have left are your two FASTQ read files and the lone FASTA genome file (they can be gzipped, it makes no difference). Index the genome by cding to the directory it's in and just typing "bwa index [genome file]". Align the two FASTQ files with "bwa aln [genome file] [fastq file] > [output file]", then do "bwa sampe [genome file] [sai file 1] [sai file 2] [fastq file 1] [fastq file 2] > [output.sam]". Don't add any other command line options. Paste the output from each step into a text file and post it up here when you're done (there shouldn't be any sensitive information in it).

                    With any luck it should either work outright, or at least create an error that'll give me some idea of what's going wrong.

                    Comment


                    • #11
                      Is that a space in the chromosome name of the fasta? That might be causing problems.

                      Comment


                      • #12
                        I'm still getting the same problem after re-running the index and align steps. The error message (exit code 134) gave the following as the output:

                        [bam_header_read] invalid BAM binary header (this is not a BAM file).
                        [bam_header_read] invalid BAM binary header (this is not a BAM file).
                        [bns_restore_core] fail to open file 'human_g1k_v37.fasta.nt.ann'. Abort!
                        /netscr/lsf/killdevil/lsbatch/1334055583.424139: line 8: 15693 Aborted (core dumped) bwa sampe human_g1k_v37.fasta ERR035484_1.sai ERR035484_2.sai ERR035484_1.fastq ERR035484_2.fastq

                        The alignment files were produced with the proper filenames/extensions (e.g. human_g1k_v37.fasta.ann). It still seems to think it's looking for a BAM file.

                        Any thoughts? I'm thinking of maybe trying a different reference sequence, but I don't really have a reason to suspect a problem with the one from 1000 Genomes.

                        swbarnes2 - I only see two spaces in the description line. Which one do you think might cause a problem? I can get rid of it with a Perl script if necessary.

                        Comment


                        • #13
                          I was able to reproduce your error: I just put in a wrong prefix index and wrong sai files. It seems bwa uses sam bam_reader functions to read the native .sai format (that's why the error appears twice). So: Do the indexing again using the -p (for Prefix) option and use this prefix (exactly this prefix) for the aln and sampe commands.

                          And probably there is something wrong with the .sai files (that may be because of the prefix issue). Did you change bwa versions between alignment and sampe step?

                          Hope that helps,
                          Peter

                          Comment


                          • #14
                            Ok. I've tried repeating the index and alignment steps and I think Peter is right about there being a problem with the .sai files. The alignment step reports that it runs successfully but, when I check the contents of the .sai files, they seem to contain only details about the lsf job queueing e.g. 'Job <533216> is submitted to queue <day>'. What could be causing this? I've been over everything I can think of, and repeated the steps multiple times (even trying earlier versions of BWA, but not mixing versions). Any more ideas after seeing the .sai output?

                            Comment


                            • #15
                              use option -f

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X