Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • natpokah
    Junior Member
    • May 2010
    • 5

    bwa sampe very slow

    We are trying to run sampe on sai files generated from fastq files obtained with hiseq2000. There is about 100 million pair ended 104 bases reads.
    we have estimated insert size at about 200 to 500 depending on libraries.

    we are running the following command line
    bwa sampe -P -a 300 (or -a 600). We also tried -o values of 1, 100 or 100000.
    but bwa runs really slowly. so far it has has run for 4 days and generated files about 1Gb. bwa is install on HPC system that should have the CPU and RAM it needs.
    Any idea about what is causing the slow behavior?
  • dp05yk
    Member
    • Dec 2010
    • 66

    #2
    Unfortunately BWA will ignore your -a specification if it believes it can estimate a better insert size. Could you post some of the stderr output? My guess is there's a huge insert size estimate being used which will lead to a long lag during "align unmapped mates".

    Try the -A (capital A) parameter. This disables Smith-Waterman mate rescuing and will likely speed up your sampe run.

    Comment

    • natpokah
      Junior Member
      • May 2010
      • 5

      #3
      Hi there!
      Thanks for a quick reply!
      You are totally right, the insert size estimate is huge and then sampe gets stuck hours trying to "align unmapped mate".
      I will follow your advice and try the -A option.

      Thanks!

      Comment

      • dp05yk
        Member
        • Dec 2010
        • 66

        #4
        Yeah, it's a weird feature. Using the -a parameter, it should be clear that you want to override the program's estimating process, but for some reason the -a parameter becomes a fallback value.

        Comment

        • natpokah
          Junior Member
          • May 2010
          • 5

          #5
          Hi again!
          So, -A worked wonders! Thank you!
          Would the fact of sorting the fastq by ID name help speed up even more?
          Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
          Thanks

          Comment

          • dp05yk
            Member
            • Dec 2010
            • 66

            #6
            Would the fact of sorting the fastq by ID name help speed up even more?
            As long as the nth read in each mate file are actually mates for all n, then you're fine.

            Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
            This is absolutely your problem. If you filter one read that has too many N's, you _need_ to also remove its mate-pair.

            BWA assumes your two mate files have an equal number of reads and that read N in one file corresponds to read N in the other file in terms of being a mate-pair. If these files are not setup this way, it's no wonder you're getting such huge insert size estimates.

            I'd be wary of preprocessing your FASTQ files for the above reasons - additionally, BWA will not be affected by many Ns in your input reads since Ns are treated as mismatches and these reads will be quickly thrown out anyways.

            Comment

            • natpokah
              Junior Member
              • May 2010
              • 5

              #7
              Yeah, the only reason to remove the reads with too many N before aligning was for our statistics down the line. We wanted to have meaningful % of unique match. And since sometimes the fastq comes with a lot of junk, calculating the % based on the whole set of reads would decrease artificially the % of Unique.
              We basically wanted to start "fresh" with 'mappable" reads.
              This is going to be useful if we run the samse on each end individually. But you are right, I will start from the raw fastq, remove my adapters and make sure I have the same number of reads in both fastq.
              Thank you so much for your precious help!

              Comment

              • dp05yk
                Member
                • Dec 2010
                • 66

                #8
                Yes - for samse, it's not really a big deal, hack up your FASTQs! The problems with FASTQ file mods are when you're pairing, because BWA will take each read from each file in order, and assume that each subsequent read from file one is a mate with the same subsequent read from file two.

                Comment

                • cbeck
                  Junior Member
                  • Sep 2011
                  • 6

                  #9
                  Heyo,
                  I guess I am getting the same issues as Nat was on our Illumina indexed runs -

                  [bwa_sai2sam_pe_core] convert to sequence coordinate...
                  [infer_isize] (25, 50, 75) percentile: (10801, 27183, 54186)
                  [infer_isize] low and high boundaries: 76 and 140956 for estimating avg and std
                  [infer_isize] inferred external isize from 1185 pairs: 34258.846 +/- 27441.256
                  [infer_isize] skewness: 0.654; kurtosis: -0.699; ap_prior: 1.00e-05
                  [infer_isize] inferred maximum insert size: 200827 (6.07 sigma)

                  Those are some huge inserts. So since I am not doing any filtering of reads at this point - do you think my reads might be getting out-of-sync because they are demultiplexed? The only thing I am doing to the qseq files is deindexing, running qseq2fastq.pl and then running bwa aln. I'm using -A as a test right now to see if it brings things down from the 5 days or so each indexed sample was taking.

                  Comment

                  • rskr
                    Senior Member
                    • Oct 2010
                    • 249

                    #10
                    BWA isn't name aware, so if the reads are out of parity during bwa sampe, it will try to model two partners which aren't proper pairs, which most likely aren't in the same neighborhood in the correct direction, so it will default to the Smith-waterman local search, which is very expensive computationally.

                    Comment

                    • cbeck
                      Junior Member
                      • Sep 2011
                      • 6

                      #11
                      Hi rskr,
                      Can you think of a reason why the partners wouldn't be pairs?

                      Comment

                      • cbeck
                        Junior Member
                        • Sep 2011
                        • 6

                        #12
                        Err, rather, since they aren't and since it is illumina's fault, would you recommend running picard's fastqtosam to sort and then samtofastq?

                        Comment

                        • rskr
                          Senior Member
                          • Oct 2010
                          • 249

                          #13
                          Originally posted by cbeck View Post
                          Hi rskr,
                          Can you think of a reason why the partners wouldn't be pairs?
                          I have seen it when one of the pairs was quality filtered but the other then it gets replaced with whatever was next in the file so, it not longer matches.

                          1.1 1.2
                          2.1 2.2
                          3.1 3.2
                          4.1 5.2 <--4.2 was omitted, they are no longer in parity.
                          5.1 6.2

                          Comment

                          • naxin
                            Junior Member
                            • Jul 2010
                            • 6

                            #14
                            Could you tell me how to use -A option, it said invalid option -- 'A'. many thanks.

                            Comment

                            • swbarnes2
                              Senior Member
                              • May 2008
                              • 910

                              #15
                              The other possibility; doublecheck that your -aln command line was right. If you accidently put a typo in one of your fastq names, and one fastq doesn't actually get aligned, sampe proceeds along anyway, and it returns crazy large insert sizes. So try running samse on each of your individual fastqs. You want to know that they are working, and you want to know if the two files are in sync.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              34 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              99 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              119 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              112 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...