Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looking for best strategy to realign BAM files

    Hi all, I was given a set of BAM files (100 Gb each) that I would like to realign using bwa. Problem is, I would like to have the best speed for doing this.
    At the moment start from chunked fastq files and send a single bwa alignment for each node of my cluster, I achieve good parallelism and speed.
    When starting from BAM files I have two options:
    1- convert BAM to fastq -> split fastq in chunks -> align in the same way
    2- feed bwa with BAM files
    If I go for (2) I cannot really parallelize the whole process, unless I can split bam files into chunks which must contain both pairs for each fragment. The only way to do this, I guess, is to sort by read name my BAM files and then split. I don't have an idea about the time required and the space for the newly sorted file
    If I go for (1) I can use picard SamToFastq but it takes ~25s every 100k reads to convert... each of my BAM files contains 130M reads, it would take more than a week only to convert.

    Does anybody want to spend two cents on this with an advice?
    thanks

    d

  • #2
    Is your data PE or SE?

    In fact you can split BAM directly, no matter compressed or not, it was using bgzip format. I believe there are tools for this.

    I would prefer feed BWA BAM files and having multiple BWA instances.

    my two pence.

    dong

    Comment


    • #3
      Yikes, any way you do this will require you to reorder things by read for bwa to run efficiently (I was unaware that one could even directly feed bwa a BAM file). Honestly, I suspect you'd be best off using one of the parallel versions of samtools to sort the monster BAM file by read name. I hope that there's only one alignment reported for each read, otherwise you have to take that into account if you then make a fastq file (maybe bwa knows how to deal with that in a BAM file).

      If you're familiar with programming, you could split the BAM file (don't forget to put a header on each of the split files!) and then sort them on different cluster nodes (not being familiar with how your cluster is made, it may be simpler to just use one of the parallel versions of samtools instead) until eventually merging them.

      Comment


      • #4
        Originally posted by xied75 View Post
        Is your data PE or SE?

        In fact you can split BAM directly, no matter compressed or not, it was using bgzip format. I believe there are tools for this.

        I would prefer feed BWA BAM files and having multiple BWA instances.

        my two pence.

        dong
        These are PE data. I've started a bwa realignment directly from BAM, using bwa threads in alignment. sampe step will take ages, though...
        thanks

        d

        Comment


        • #5
          Originally posted by dawe View Post
          These are PE data. I've started a bwa realignment directly from BAM, using bwa threads in alignment. sampe step will take ages, though...
          thanks

          d
          Yes because sampe is single thread no matter how powerful your machine is and one instance will eat >6GB memory so making run multiple instances also difficult unless you have 128GB something.

          Turn on -P will eat more memory, but should run faster.

          My Windows bwa can do multithread sampe, I suggested people could use my way to modify the Linux version, seems nobody interested to take on this job. If I could have some free time I might do it myself.

          Best,

          dong

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          23 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X