Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • natpokah
    Junior Member
    • May 2010
    • 5

    bwa sampe very slow

    We are trying to run sampe on sai files generated from fastq files obtained with hiseq2000. There is about 100 million pair ended 104 bases reads.
    we have estimated insert size at about 200 to 500 depending on libraries.

    we are running the following command line
    bwa sampe -P -a 300 (or -a 600). We also tried -o values of 1, 100 or 100000.
    but bwa runs really slowly. so far it has has run for 4 days and generated files about 1Gb. bwa is install on HPC system that should have the CPU and RAM it needs.
    Any idea about what is causing the slow behavior?
  • dp05yk
    Member
    • Dec 2010
    • 66

    #2
    Unfortunately BWA will ignore your -a specification if it believes it can estimate a better insert size. Could you post some of the stderr output? My guess is there's a huge insert size estimate being used which will lead to a long lag during "align unmapped mates".

    Try the -A (capital A) parameter. This disables Smith-Waterman mate rescuing and will likely speed up your sampe run.

    Comment

    • natpokah
      Junior Member
      • May 2010
      • 5

      #3
      Hi there!
      Thanks for a quick reply!
      You are totally right, the insert size estimate is huge and then sampe gets stuck hours trying to "align unmapped mate".
      I will follow your advice and try the -A option.

      Thanks!

      Comment

      • dp05yk
        Member
        • Dec 2010
        • 66

        #4
        Yeah, it's a weird feature. Using the -a parameter, it should be clear that you want to override the program's estimating process, but for some reason the -a parameter becomes a fallback value.

        Comment

        • natpokah
          Junior Member
          • May 2010
          • 5

          #5
          Hi again!
          So, -A worked wonders! Thank you!
          Would the fact of sorting the fastq by ID name help speed up even more?
          Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
          Thanks

          Comment

          • dp05yk
            Member
            • Dec 2010
            • 66

            #6
            Would the fact of sorting the fastq by ID name help speed up even more?
            As long as the nth read in each mate file are actually mates for all n, then you're fine.

            Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
            This is absolutely your problem. If you filter one read that has too many N's, you _need_ to also remove its mate-pair.

            BWA assumes your two mate files have an equal number of reads and that read N in one file corresponds to read N in the other file in terms of being a mate-pair. If these files are not setup this way, it's no wonder you're getting such huge insert size estimates.

            I'd be wary of preprocessing your FASTQ files for the above reasons - additionally, BWA will not be affected by many Ns in your input reads since Ns are treated as mismatches and these reads will be quickly thrown out anyways.

            Comment

            • natpokah
              Junior Member
              • May 2010
              • 5

              #7
              Yeah, the only reason to remove the reads with too many N before aligning was for our statistics down the line. We wanted to have meaningful % of unique match. And since sometimes the fastq comes with a lot of junk, calculating the % based on the whole set of reads would decrease artificially the % of Unique.
              We basically wanted to start "fresh" with 'mappable" reads.
              This is going to be useful if we run the samse on each end individually. But you are right, I will start from the raw fastq, remove my adapters and make sure I have the same number of reads in both fastq.
              Thank you so much for your precious help!

              Comment

              • dp05yk
                Member
                • Dec 2010
                • 66

                #8
                Yes - for samse, it's not really a big deal, hack up your FASTQs! The problems with FASTQ file mods are when you're pairing, because BWA will take each read from each file in order, and assume that each subsequent read from file one is a mate with the same subsequent read from file two.

                Comment

                • cbeck
                  Junior Member
                  • Sep 2011
                  • 6

                  #9
                  Heyo,
                  I guess I am getting the same issues as Nat was on our Illumina indexed runs -

                  [bwa_sai2sam_pe_core] convert to sequence coordinate...
                  [infer_isize] (25, 50, 75) percentile: (10801, 27183, 54186)
                  [infer_isize] low and high boundaries: 76 and 140956 for estimating avg and std
                  [infer_isize] inferred external isize from 1185 pairs: 34258.846 +/- 27441.256
                  [infer_isize] skewness: 0.654; kurtosis: -0.699; ap_prior: 1.00e-05
                  [infer_isize] inferred maximum insert size: 200827 (6.07 sigma)

                  Those are some huge inserts. So since I am not doing any filtering of reads at this point - do you think my reads might be getting out-of-sync because they are demultiplexed? The only thing I am doing to the qseq files is deindexing, running qseq2fastq.pl and then running bwa aln. I'm using -A as a test right now to see if it brings things down from the 5 days or so each indexed sample was taking.

                  Comment

                  • rskr
                    Senior Member
                    • Oct 2010
                    • 249

                    #10
                    BWA isn't name aware, so if the reads are out of parity during bwa sampe, it will try to model two partners which aren't proper pairs, which most likely aren't in the same neighborhood in the correct direction, so it will default to the Smith-waterman local search, which is very expensive computationally.

                    Comment

                    • cbeck
                      Junior Member
                      • Sep 2011
                      • 6

                      #11
                      Hi rskr,
                      Can you think of a reason why the partners wouldn't be pairs?

                      Comment

                      • cbeck
                        Junior Member
                        • Sep 2011
                        • 6

                        #12
                        Err, rather, since they aren't and since it is illumina's fault, would you recommend running picard's fastqtosam to sort and then samtofastq?

                        Comment

                        • rskr
                          Senior Member
                          • Oct 2010
                          • 249

                          #13
                          Originally posted by cbeck View Post
                          Hi rskr,
                          Can you think of a reason why the partners wouldn't be pairs?
                          I have seen it when one of the pairs was quality filtered but the other then it gets replaced with whatever was next in the file so, it not longer matches.

                          1.1 1.2
                          2.1 2.2
                          3.1 3.2
                          4.1 5.2 <--4.2 was omitted, they are no longer in parity.
                          5.1 6.2

                          Comment

                          • naxin
                            Junior Member
                            • Jul 2010
                            • 6

                            #14
                            Could you tell me how to use -A option, it said invalid option -- 'A'. many thanks.

                            Comment

                            • swbarnes2
                              Senior Member
                              • May 2008
                              • 910

                              #15
                              The other possibility; doublecheck that your -aln command line was right. If you accidently put a typo in one of your fastq names, and one fastq doesn't actually get aligned, sampe proceeds along anyway, and it returns crazy large insert sizes. So try running samse on each of your individual fastqs. You want to know that they are working, and you want to know if the two files are in sync.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              27 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              38 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              61 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...