Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help~~bwa sampe extremely slow~!!

    Hi, just met the situation, and not sure if it is normal.

    We used 100bp pair-end and HiSeq generated ~90 million reads. However, when using bwa to map the reads onto the human reference genome, it has taken one whole day, and only ~9 million reads been mapped with bwa sampe command, which was also piped with samtools view to convert sam to bam.

    I checked the log files, and everything seemed normal, and it kept reporting the progress and also the issize.... but it seemed way too slow, and I have no idea about it... is it normal?

    any advice will be highly appreciated. Thanks!

  • #2
    What insert sizes does sampe report that it sees?

    You could try doing samse on each half, and eyeballing the sam files, to see if the pairs look right

    Comment


    • #3
      the insert site seems alright, just around 300-400 bp... but it is really slow.. do ya know if there is anything wrong with it?
      Thanks so much~

      Here is an examplar output -
      [infer_isize] (25, 50, 75) percentile: (313, 337, 363)
      [infer_isize] low and high boundaries: 213 and 463 for estimating avg and std
      [infer_isize] inferred external isize from 214684 pairs: 337.152 +/- 40.108
      [infer_isize] skewness: -0.131; kurtosis: 0.117; ap_prior: 2.61e-05
      [infer_isize] inferred maximum insert size: 616 (6.94 sigma)
      [bwa_sai2sam_pe_core] time elapses: 19.58 sec
      [bwa_sai2sam_pe_core] changing coordinates of 7276 alignments.
      [bwa_sai2sam_pe_core] align unmapped mate...
      [bwa_paired_sw] 7091 out of 7430 Q17 singletons are mated.
      [bwa_paired_sw] 1969 out of 3915 Q17 discordant pairs are fixed.
      [bwa_sai2sam_pe_core] time elapses: 5.56 sec
      [bwa_sai2sam_pe_core] refine gapped alignments... 0.82 sec
      [bwa_sai2sam_pe_core] print alignments... 2.48 sec
      [bwa_sai2sam_pe_core] 8912896 sequences have been processed.
      [bwa_read_seq] 1.0% bases are trimmed.
      [bwa_read_seq] 1.6% bases are trimmed.
      [bwa_sai2sam_pe_core] convert to sequence coordinate...
      Last edited by caswater; 04-06-2012, 11:09 PM.

      Comment


      • #4
        can any one give any advice on this? thanks a lot!!

        Comment


        • #5
          Not sure if this helps but sampe has always been a slow step for us. We are working on a bacterial genome (only ~1.3Mb in size) and have ~80M*2 PE reads (insert size=~365bp). It takes ~4,000 sec CPU time to run the aln step for each end but we can cut this down to a couple of minutes with multi-threading. The sampe step takes ~5,100 sec and there's not much we can do to reduce this.

          One possibility is to change the '-o' option to discard the reads that are involved in repeats. I guess this probably would help in the cases of human/plant genomes. There are simply too few repeats in bacterial genomes in comparison so we didn't bother to change the default.
          Last edited by chkuo; 05-03-2012, 01:36 AM.

          Comment


          • #6
            Dear all,

            As you may observed, SAMPE's bottleneck 1: it's single threaded; 2: it's I/O bound.

            If your I/O subsystem (i.e. your disks) is not very fast, please use -P switch, then it'll stop doing the loading files again and again (Do you know that for each batch of 214684 reads, it reads the .BWT, .SA files into memory, use it, dump it, then load .BAC, use it, dump it, and do that again for next batch).

            Use -P these files will stay in the memory and you kind of see a constant memory footprint over the whole run, (note: it does use more memory than without it, you better have 8GB for human ref).

            In my recent port of BWA to Windows, I added a -t switch to SAMPE, so that you could do multithreading, but I guess you guys don't use windows.

            Best,

            dong

            Comment


            • #7
              Forgot to mention that we do use the '-P' option. Any chance of multi-threading sampe in the linux version soon?

              Comment


              • #8
                Originally posted by caswater View Post
                the insert site seems alright, just around 300-400 bp... but it is really slow.. do ya know if there is anything wrong with it?
                Thanks so much~

                Here is an examplar output -
                [infer_isize] (25, 50, 75) percentile: (313, 337, 363)
                [infer_isize] low and high boundaries: 213 and 463 for estimating avg and std
                [infer_isize] inferred external isize from 214684 pairs: 337.152 +/- 40.108
                [infer_isize] skewness: -0.131; kurtosis: 0.117; ap_prior: 2.61e-05
                [infer_isize] inferred maximum insert size: 616 (6.94 sigma)
                [bwa_sai2sam_pe_core] time elapses: 19.58 sec
                [bwa_sai2sam_pe_core] changing coordinates of 7276 alignments.
                [bwa_sai2sam_pe_core] align unmapped mate...
                [bwa_paired_sw] 7091 out of 7430 Q17 singletons are mated.
                [bwa_paired_sw] 1969 out of 3915 Q17 discordant pairs are fixed.
                [bwa_sai2sam_pe_core] time elapses: 5.56 sec
                [bwa_sai2sam_pe_core] refine gapped alignments... 0.82 sec
                [bwa_sai2sam_pe_core] print alignments... 2.48 sec
                [bwa_sai2sam_pe_core] 8912896 sequences have been processed.
                [bwa_read_seq] 1.0% bases are trimmed.
                [bwa_read_seq] 1.6% bases are trimmed.
                [bwa_sai2sam_pe_core] convert to sequence coordinate...
                Your elapsed times are similar to mine but maybe a little slower. What type of computer are you using?
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  How about you go 1000G site and download a bam and run on that and give us some numbers, then we do the same so that we could compare.

                  Comment


                  • #10
                    thanks a lot... using -P will indeed substantially reduce the computational time. Thanks a lot for all your suggestions!

                    Comment


                    • #11
                      You could also try running sampe with the -s switch to disable smith-waterman for an unmapped mate. Obviously, it depends on the sensitivity you want and of your genome of interest, but that should speed it up as well...

                      Comment


                      • #12
                        You'll need to rule out the obvious problems first.

                        Is your data on a slow mounted drive?
                        Are other people running many jobs on your machine?
                        What machine type are you running? How many CPUs?

                        run this program from the command line:

                        grep bogomips /proc/cpuinfo

                        "bogomips" is a measure of cpu speed

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Advancing Precision Medicine for Rare Diseases in Children
                          by seqadmin




                          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                          12-16-2024, 07:57 AM
                        • seqadmin
                          Recent Advances in Sequencing Technologies
                          by seqadmin



                          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                          Long-Read Sequencing
                          Long-read sequencing has seen remarkable advancements,...
                          12-02-2024, 01:49 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 12-17-2024, 10:28 AM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-13-2024, 08:24 AM
                        0 responses
                        42 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-12-2024, 07:41 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-11-2024, 07:45 AM
                        0 responses
                        42 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X