Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help~~bwa sampe extremely slow~!!

    Hi, just met the situation, and not sure if it is normal.

    We used 100bp pair-end and HiSeq generated ~90 million reads. However, when using bwa to map the reads onto the human reference genome, it has taken one whole day, and only ~9 million reads been mapped with bwa sampe command, which was also piped with samtools view to convert sam to bam.

    I checked the log files, and everything seemed normal, and it kept reporting the progress and also the issize.... but it seemed way too slow, and I have no idea about it... is it normal?

    any advice will be highly appreciated. Thanks!

  • #2
    What insert sizes does sampe report that it sees?

    You could try doing samse on each half, and eyeballing the sam files, to see if the pairs look right

    Comment


    • #3
      the insert site seems alright, just around 300-400 bp... but it is really slow.. do ya know if there is anything wrong with it?
      Thanks so much~

      Here is an examplar output -
      [infer_isize] (25, 50, 75) percentile: (313, 337, 363)
      [infer_isize] low and high boundaries: 213 and 463 for estimating avg and std
      [infer_isize] inferred external isize from 214684 pairs: 337.152 +/- 40.108
      [infer_isize] skewness: -0.131; kurtosis: 0.117; ap_prior: 2.61e-05
      [infer_isize] inferred maximum insert size: 616 (6.94 sigma)
      [bwa_sai2sam_pe_core] time elapses: 19.58 sec
      [bwa_sai2sam_pe_core] changing coordinates of 7276 alignments.
      [bwa_sai2sam_pe_core] align unmapped mate...
      [bwa_paired_sw] 7091 out of 7430 Q17 singletons are mated.
      [bwa_paired_sw] 1969 out of 3915 Q17 discordant pairs are fixed.
      [bwa_sai2sam_pe_core] time elapses: 5.56 sec
      [bwa_sai2sam_pe_core] refine gapped alignments... 0.82 sec
      [bwa_sai2sam_pe_core] print alignments... 2.48 sec
      [bwa_sai2sam_pe_core] 8912896 sequences have been processed.
      [bwa_read_seq] 1.0% bases are trimmed.
      [bwa_read_seq] 1.6% bases are trimmed.
      [bwa_sai2sam_pe_core] convert to sequence coordinate...
      Last edited by caswater; 04-06-2012, 11:09 PM.

      Comment


      • #4
        can any one give any advice on this? thanks a lot!!

        Comment


        • #5
          Not sure if this helps but sampe has always been a slow step for us. We are working on a bacterial genome (only ~1.3Mb in size) and have ~80M*2 PE reads (insert size=~365bp). It takes ~4,000 sec CPU time to run the aln step for each end but we can cut this down to a couple of minutes with multi-threading. The sampe step takes ~5,100 sec and there's not much we can do to reduce this.

          One possibility is to change the '-o' option to discard the reads that are involved in repeats. I guess this probably would help in the cases of human/plant genomes. There are simply too few repeats in bacterial genomes in comparison so we didn't bother to change the default.
          Last edited by chkuo; 05-03-2012, 01:36 AM.

          Comment


          • #6
            Dear all,

            As you may observed, SAMPE's bottleneck 1: it's single threaded; 2: it's I/O bound.

            If your I/O subsystem (i.e. your disks) is not very fast, please use -P switch, then it'll stop doing the loading files again and again (Do you know that for each batch of 214684 reads, it reads the .BWT, .SA files into memory, use it, dump it, then load .BAC, use it, dump it, and do that again for next batch).

            Use -P these files will stay in the memory and you kind of see a constant memory footprint over the whole run, (note: it does use more memory than without it, you better have 8GB for human ref).

            In my recent port of BWA to Windows, I added a -t switch to SAMPE, so that you could do multithreading, but I guess you guys don't use windows.

            Best,

            dong

            Comment


            • #7
              Forgot to mention that we do use the '-P' option. Any chance of multi-threading sampe in the linux version soon?

              Comment


              • #8
                Originally posted by caswater View Post
                the insert site seems alright, just around 300-400 bp... but it is really slow.. do ya know if there is anything wrong with it?
                Thanks so much~

                Here is an examplar output -
                [infer_isize] (25, 50, 75) percentile: (313, 337, 363)
                [infer_isize] low and high boundaries: 213 and 463 for estimating avg and std
                [infer_isize] inferred external isize from 214684 pairs: 337.152 +/- 40.108
                [infer_isize] skewness: -0.131; kurtosis: 0.117; ap_prior: 2.61e-05
                [infer_isize] inferred maximum insert size: 616 (6.94 sigma)
                [bwa_sai2sam_pe_core] time elapses: 19.58 sec
                [bwa_sai2sam_pe_core] changing coordinates of 7276 alignments.
                [bwa_sai2sam_pe_core] align unmapped mate...
                [bwa_paired_sw] 7091 out of 7430 Q17 singletons are mated.
                [bwa_paired_sw] 1969 out of 3915 Q17 discordant pairs are fixed.
                [bwa_sai2sam_pe_core] time elapses: 5.56 sec
                [bwa_sai2sam_pe_core] refine gapped alignments... 0.82 sec
                [bwa_sai2sam_pe_core] print alignments... 2.48 sec
                [bwa_sai2sam_pe_core] 8912896 sequences have been processed.
                [bwa_read_seq] 1.0% bases are trimmed.
                [bwa_read_seq] 1.6% bases are trimmed.
                [bwa_sai2sam_pe_core] convert to sequence coordinate...
                Your elapsed times are similar to mine but maybe a little slower. What type of computer are you using?
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  How about you go 1000G site and download a bam and run on that and give us some numbers, then we do the same so that we could compare.

                  Comment


                  • #10
                    thanks a lot... using -P will indeed substantially reduce the computational time. Thanks a lot for all your suggestions!

                    Comment


                    • #11
                      You could also try running sampe with the -s switch to disable smith-waterman for an unmapped mate. Obviously, it depends on the sensitivity you want and of your genome of interest, but that should speed it up as well...

                      Comment


                      • #12
                        You'll need to rule out the obvious problems first.

                        Is your data on a slow mounted drive?
                        Are other people running many jobs on your machine?
                        What machine type are you running? How many CPUs?

                        run this program from the command line:

                        grep bogomips /proc/cpuinfo

                        "bogomips" is a measure of cpu speed

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        59 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        57 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        56 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X