Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa sampe very slow

    We are trying to run sampe on sai files generated from fastq files obtained with hiseq2000. There is about 100 million pair ended 104 bases reads.
    we have estimated insert size at about 200 to 500 depending on libraries.

    we are running the following command line
    bwa sampe -P -a 300 (or -a 600). We also tried -o values of 1, 100 or 100000.
    but bwa runs really slowly. so far it has has run for 4 days and generated files about 1Gb. bwa is install on HPC system that should have the CPU and RAM it needs.
    Any idea about what is causing the slow behavior?

  • #2
    Unfortunately BWA will ignore your -a specification if it believes it can estimate a better insert size. Could you post some of the stderr output? My guess is there's a huge insert size estimate being used which will lead to a long lag during "align unmapped mates".

    Try the -A (capital A) parameter. This disables Smith-Waterman mate rescuing and will likely speed up your sampe run.

    Comment


    • #3
      Hi there!
      Thanks for a quick reply!
      You are totally right, the insert size estimate is huge and then sampe gets stuck hours trying to "align unmapped mate".
      I will follow your advice and try the -A option.

      Thanks!

      Comment


      • #4
        Yeah, it's a weird feature. Using the -a parameter, it should be clear that you want to override the program's estimating process, but for some reason the -a parameter becomes a fallback value.

        Comment


        • #5
          Hi again!
          So, -A worked wonders! Thank you!
          Would the fact of sorting the fastq by ID name help speed up even more?
          Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
          Thanks

          Comment


          • #6
            Would the fact of sorting the fastq by ID name help speed up even more?
            As long as the nth read in each mate file are actually mates for all n, then you're fine.

            Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
            This is absolutely your problem. If you filter one read that has too many N's, you _need_ to also remove its mate-pair.

            BWA assumes your two mate files have an equal number of reads and that read N in one file corresponds to read N in the other file in terms of being a mate-pair. If these files are not setup this way, it's no wonder you're getting such huge insert size estimates.

            I'd be wary of preprocessing your FASTQ files for the above reasons - additionally, BWA will not be affected by many Ns in your input reads since Ns are treated as mismatches and these reads will be quickly thrown out anyways.

            Comment


            • #7
              Yeah, the only reason to remove the reads with too many N before aligning was for our statistics down the line. We wanted to have meaningful % of unique match. And since sometimes the fastq comes with a lot of junk, calculating the % based on the whole set of reads would decrease artificially the % of Unique.
              We basically wanted to start "fresh" with 'mappable" reads.
              This is going to be useful if we run the samse on each end individually. But you are right, I will start from the raw fastq, remove my adapters and make sure I have the same number of reads in both fastq.
              Thank you so much for your precious help!

              Comment


              • #8
                Yes - for samse, it's not really a big deal, hack up your FASTQs! The problems with FASTQ file mods are when you're pairing, because BWA will take each read from each file in order, and assume that each subsequent read from file one is a mate with the same subsequent read from file two.

                Comment


                • #9
                  Heyo,
                  I guess I am getting the same issues as Nat was on our Illumina indexed runs -

                  [bwa_sai2sam_pe_core] convert to sequence coordinate...
                  [infer_isize] (25, 50, 75) percentile: (10801, 27183, 54186)
                  [infer_isize] low and high boundaries: 76 and 140956 for estimating avg and std
                  [infer_isize] inferred external isize from 1185 pairs: 34258.846 +/- 27441.256
                  [infer_isize] skewness: 0.654; kurtosis: -0.699; ap_prior: 1.00e-05
                  [infer_isize] inferred maximum insert size: 200827 (6.07 sigma)

                  Those are some huge inserts. So since I am not doing any filtering of reads at this point - do you think my reads might be getting out-of-sync because they are demultiplexed? The only thing I am doing to the qseq files is deindexing, running qseq2fastq.pl and then running bwa aln. I'm using -A as a test right now to see if it brings things down from the 5 days or so each indexed sample was taking.

                  Comment


                  • #10
                    BWA isn't name aware, so if the reads are out of parity during bwa sampe, it will try to model two partners which aren't proper pairs, which most likely aren't in the same neighborhood in the correct direction, so it will default to the Smith-waterman local search, which is very expensive computationally.

                    Comment


                    • #11
                      Hi rskr,
                      Can you think of a reason why the partners wouldn't be pairs?

                      Comment


                      • #12
                        Err, rather, since they aren't and since it is illumina's fault, would you recommend running picard's fastqtosam to sort and then samtofastq?

                        Comment


                        • #13
                          Originally posted by cbeck View Post
                          Hi rskr,
                          Can you think of a reason why the partners wouldn't be pairs?
                          I have seen it when one of the pairs was quality filtered but the other then it gets replaced with whatever was next in the file so, it not longer matches.

                          1.1 1.2
                          2.1 2.2
                          3.1 3.2
                          4.1 5.2 <--4.2 was omitted, they are no longer in parity.
                          5.1 6.2

                          Comment


                          • #14
                            Could you tell me how to use -A option, it said invalid option -- 'A'. many thanks.

                            Comment


                            • #15
                              The other possibility; doublecheck that your -aln command line was right. If you accidently put a typo in one of your fastq names, and one fastq doesn't actually get aligned, sampe proceeds along anyway, and it returns crazy large insert sizes. So try running samse on each of your individual fastqs. You want to know that they are working, and you want to know if the two files are in sync.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X