Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa - sampe - large insert sizes and slow

    Hi, I've been given some BAM files (nothing else) and I'd like to call variants on them. I want to realign the sequences so I've been following the bam aln parameters:
    bwa aln ref.fa -b1 reads.bam > 1.sai
    bwa aln ref.fa -b2 reads.bam > 2.sai
    bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam

    The sampe stage is taking a *very* long time and with massive insert sizes suggesting the ordering is wrong. The bam files are sorted by read names, any suggestions on how I can make sure that 1.sai and 2.sai are ordered properly for the sampe stage to work, thanks.

  • #2
    What are your read lengths? If 75bp+ I'd recommend re-doing the whole thing with BWA-MEM which may fix your sampe problem. Not to mention that BWA-MEM is an improvement in my opinion, based on some of the GCAT benchmarks:

    View Full GCAT Alignment Report: Bwa (100bp-se-small-indel)

    Comment


    • #3
      Hi Oiio,

      Thanks for the comment. I've no idea how long the reads are, just got the data, older project, no other information available. I'll have a closer look at BWA-MEM, cheers.

      Comment


      • #4
        First thing to check, grab the first few pairs of reads, and double-check that they are sorted by name, and that they are orientated they way you expect.
        Usually, huge insert sizes mean that your reads are not paired the way they should be. Maybe you are missing some reads?

        And I know that you got those command lines off of the documentation, but you can also try either splitting the .bams into read 1 and read 2, or converting the .bams to fastqs.

        Comment


        • #5
          Hi swbarnes2,
          Thanks for the comments. Sorry, newbie question, how do I split a bam file into read 1 and read 2?, thanks. I did try the fastq approach originally and sampe was again incredibly slow.

          Comment


          • #6
            Learn what the flag value in a .sam file means. samtools view will help you split a file by what flags it has or doesn't have.

            Comment


            • #7
              Thanks swbarnes2, for the benefit of the thread, this was useful:

              so samtools view -f64 and samtools view -f128 seem to do what I want, thanks.
              The command bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam had been running for days so clearly isn't correct, thanks for the tip of replacing reads.bam with reads1.bam and reads2.bam.

              Comment


              • #8
                Well this obviously still isn't correct as it has taken over 24 hours to create one .sam file, and the insert sizes still look wrong, e.g. 24251. If anyone can enlighten me as to how to realign bam files, I would greatly appreciate it.

                Comment


                • #9
                  Originally posted by Elsie View Post
                  Well this obviously still isn't correct as it has taken over 24 hours to create one .sam file, and the insert sizes still look wrong, e.g. 24251. If anyone can enlighten me as to how to realign bam files, I would greatly appreciate it.
                  If you checked the names of all the reads in read 1, and all the names of the reads in read 2, and you determined that there was a perfect correlation between them, so that at all points, the nth read of the read 1 file is the mate of the nth read in the read 2 file, and samtools is telling you that your insert sizes are that huge, than that's what they are, when aligned to that reference.

                  Comment


                  • #10
                    Hi, thanks again for your help. Nope, the reads in read 1 are not identical to the reads in read2. So there must be reads in there where the first in the pair has been mapped but the second hasn't (and vice versa). So I'll need to be more specific with my sam flags, only pick reads that have both first and second, and then do bwa samse with the unpaired reads treating them as single end and then merge the two bam files. Thanks swbarnes2, I appreciate your help.

                    Comment


                    • #11
                      Hi, I was using the wrong flag, I should have used -F8, the other flags were allowing other reads in. So the reads are now identical but the sampe stage is still taking ages, e.g. "inferred maximum insert size: 12074". Reference is fine (most recent mouse, these are mouse bams), I cannot believe that insert size so I must be missing something in this step:
                      bwa aln ref.fa -b1 reads.bam > 1.sai
                      bwa aln ref.fa -b2 reads.bam > 2.sai

                      Comment


                      • #12
                        So I'm assuming that with this command:
                        bwa aln ref.fa -b1 reads.bam > 1.sai
                        all the first reads in a read pair are used, regardless of whether or not they have a mate that maps, and that this is causing the massive insert sizes, a read without a pair. Any ideas how I can prevent this?

                        Comment


                        • #13
                          I am being a muppet:
                          samtools view -b -F8 bam.bam > bammapped.bam
                          bwa aln ref.fa -b1 bammapped.bam > b1.sai
                          bwa aln ref.fa -b2 bammapped.bam > b2.sai
                          bwa sampe ref.fa b1.sai b2.sai bamampped.bam bammapped.bam > file.sam

                          Comment


                          • #14
                            Hi Elsie,

                            I used to have this problem and I could solve it only after checking some forums so I just added the -A option to the sampe and obviously this will simply stop the insert size estimate. I am not sure on the impact of that on your downstreem steps but so far I have no problem on my own. You can check the bwa sampe options.

                            Comment


                            • #15
                              Originally posted by swbarnes2 View Post
                              If you checked the names of all the reads in read 1, and all the names of the reads in read 2, and you determined that there was a perfect correlation between them, so that at all points, the nth read of the read 1 file is the mate of the nth read in the read 2 file, and samtools is telling you that your insert sizes are that huge, than that's what they are, when aligned to that reference.
                              Why would bwa try to pair two reads that have different names just because they are on the nth read in a list? I'm having the same problem, with my pe read names not sorting to same line. I traced this problem to my quality trimming, but I don't understand why you would throw away a perfectly good read just because its mate didn't pass quality trimming. This is the only way you could have nth reads lining up after trimming.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X