Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastest solution to get bam2fastq done

    Hi,
    I am currently using Hydra package solution but quite quickly this gets clogged as all the RAM is used (4gb). Is there any other solution that has smaller memory footprint (picard?)? What would be the optimal memory size (considering that any other tool would not offer better memory handling)?

    Thanks!

  • #2
    When you say bam2fastq are you trying to recover the original read sequences (possibly paired) as FASTQ files, in their original orientation (reverse complement undone if mapped to the reverse strand)?

    Comment


    • #3
      that is correct maubp.

      Comment


      • #4
        I think EMBOSS seqret can do BAM to FASTQ, but I'm not sure what it does with pairs or reads mapped to the reverse strand.

        I did this once myself using a custom Python script.

        Comment


        • #5
          This software worked OK on my 4GB workstation:



          Chris

          Comment


          • #6
            bampe2fqs.c using AVL trees ...

            For the peoples ...

            I've run into similar problems: bam to fastq implementations take up too much memory and take all day to run. I whipped together my own solution.
            This thing is very fast and uses very little memory (it outputs a pair when the second pair is input). It will take a little knowledge of how to point to libraries and headers to get working on your system.

            It uses AVL trees. It only works on paired end data.

            You must install samtools, tree library and possibly zlib. The tree library is at http://piumarta.com/software/tree/


            CODE HAS BEEN REDACTED FOR MORE ROBUST VERSION, SEE LATER POSTS (edit 6/21/2013)
            Last edited by Richard Finney; 06-21-2013, 10:46 AM.

            Comment


            • #7
              @Richard

              wow, this really zips thru the data! is there counterpart solution for non-paired data?

              Comment


              • #8
                Single End bam to fastq ?

                There's actually faster ways to do it, a hash table implementation I did was 5x faster, but had a fixed amount of memory. I just wanted something simple and guaranteed to work on about any system (including the 4GB nodes on biowulf at NIH) and would run fast enough that I didn't lose focus. The implementation above uses heap for unmatched pairs and so uses very little memory on a proper ("strict"?) paired end bam. Soon as the pair is matched, the node is deleted and the AVL tree is rebalanced. The pieces are there for a single end implementation, just need to cut some code out. I'll see if can do it; check back tomorrow. Everybody's welcome to do whatever they want with the code.

                Comment


                • #9
                  bamse2fq.c - single ended version

                  Here it is. This is the single ended version. Nothing special but does (re-)reverse compliment the sequence and (re-)reverses the quality for opposite strand hits. Please check your results as this is brand new code.


                  /*

                  bam to fastq SE (single end) version [ use bampe2fq for paried end data] .
                  how to compile:

                  edit line below to point to samtools libraries

                  gcc -Wall -O3 -I/h1/finneyr/samtools-0.1.18/ bamse2fqs.c -o bamse2fqs -lm /h1/finneyr/samtools-0.1.18/libbam.a /h1/finneyr//zlib-1.2.5/libz.a (AVL tree not used in this version).

                  example
                  ./bamse2fqs /tcga_next_gen/NG_buckets/bucket11/bam/TCGA-AA-3858-01A-01W-0900-09_IlluminaGA-DNASeq_exome.bam

                  outputs a fastq file called 1se.fq

                  */


                  #include <stdio.h>
                  #include <stdlib.h>
                  off_t ftello(FILE *stream); // sam.h complains unless this prototype is there for -Wall on older systems
                  #include "sam.h"
                  #include <sys/types.h>
                  #include <unistd.h>

                  static char seq[500];
                  static char qual[500];

                  static char compli(char ch)
                  {
                  if (ch == 'A') return 'T';
                  else if (ch == 'T') return 'A';
                  else if (ch == 'C') return 'G';
                  else if (ch == 'G') return 'C';
                  else if (ch == 'N') return 'N';
                  else if (ch == 'a') return 't';
                  else if (ch == 't') return 'a';
                  else if (ch == 'c') return 'g';
                  else if (ch == 'g') return 'c';
                  return '.';
                  }

                  static void reverse_and_compliment(char s[], char outs[], int len)
                  {
                  register int i,j;

                  j = 0;
                  i = len - 1;
                  while (i >= 0)
                  outs[j++] = compli(s[i--]);
                  return;
                  }

                  static void reverse(char s[], char outs[])
                  {
                  register int i,j;

                  i = strlen(s);
                  i--;

                  j = 0;
                  while (i >= 0)
                  {
                  outs[j] = s[i];
                  j++;
                  i--;
                  }
                  outs[j] = (char)0;
                  strcpy(s,outs);
                  return;
                  }


                  static FILE *fp1; // fastq output file


                  long int in_cnt = 0L;
                  long int flip_cnt = 0L;

                  static int fetch_func(const bam1_t *b, void *data)
                  {
                  uint32_t *cigar = bam1_cigar(b);
                  const bam1_core_t *c = &b->core;
                  int i, l;
                  // NO!!!!!!!!! if (b->core.tid < 0) return 0;
                  for (i = l = 0; i < c->n_cigar; ++i) {
                  int op = cigar[i]&0xf;
                  if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                  l += cigar[i]>>4;
                  }

                  uint8_t *s = bam1_seq(b);
                  uint8_t *t = bam1_qual(b);
                  qual[0] = (char)0;
                  seq[0] = (char)0;
                  for (i = 0; i < c->l_qseq; i++) seq[i] = bam_nt16_rev_table[bam1_seqi(s, i)];
                  seq[i] = (char)0;
                  i = 0;
                  for (i = 0; i < c->l_qseq; ++i) qual[i] = t[i] + 33;
                  qual[i] = (char)0;
                  if (c->flag&BAM_FREVERSE)
                  {
                  char tmps[512];

                  reverse_and_compliment(seq, tmps, strlen(seq));
                  strcpy(seq,tmps);
                  reverse(qual, tmps);
                  strcpy(qual,tmps);
                  flip_cnt++;
                  }
                  in_cnt++;

                  fprintf(fp1,"@%s\n%s\n+\n%s\n",bam1_qname(b), seq,qual); // single end, just output it

                  #if 0 // for paired end - see bampe2fq.c
                  if (c->flag&BAM_FREAD1) add(bam1_qname(b),0,seq,qual);
                  else add(bam1_qname(b),1,seq,qual);
                  #endif
                  return 0;
                  }


                  int main(int argc, char *argv[])
                  {
                  samfile_t *fp;

                  if (argc == 1) {
                  fprintf(stderr, "Usage: bamse2fq <in.bam> \n");
                  return 1;
                  }
                  if ((fp = samopen(argv[1], "rb", 0)) == 0) {
                  fprintf(stderr, "bamse2fq: Fail to open BAM file %s\n", argv[1]);
                  return 1;
                  }

                  if (argc == 2) { /* if a region is not specified */
                  fp1 = fopen("1se.fq","w");
                  bam1_t *b = bam_init1();
                  while (samread(fp, b) >= 0) fetch_func(b, fp);
                  bam_destroy1(b);
                  fclose(fp1);
                  printf("DONE in_cnt = %ld , reversed=%ld \n",in_cnt,flip_cnt);
                  } else {
                  fprintf(stderr,"ERROR: ");
                  fprintf(stderr, "Usage: bamse2fq <in.bam> \n");
                  }
                  samclose(fp);
                  return 0;
                  }
                  Last edited by Richard Finney; 12-22-2011, 06:35 PM.

                  Comment


                  • #10
                    The latest samtools actually has a hidden command: bam2fq, which converts a BAM to *single-end* FASTQ.

                    @Richard: samtools comes with a single-header hash table library khash.h. BTW, I could not open your link. I would be interested in an AVL tree implementation in C. I know a couple single-header libraries for B-tree and red-black trees.

                    Comment


                    • #11
                      AVL library

                      Heng (lh3),
                      Yes, the link is down. It was up the other day. The server at http://piumarta.com must be taking a holiday. It will probably be back. Regardless I'll email you the AVL "library" at your "me" email address (let me know if there's a better one). Tree-1.0 is really just include files, not an actual linkable library. It is MIT license. Re-implementing the data structures and operations of an AVL package would be pretty easy if it does not suit your needs.

                      AVL seems most elegant way and will avoid the pathological worst case key clashes of a hash solution; but hash will in most cases be faster with the cost of larger memory usage.

                      Comment


                      • #12
                        Hi guys,
                        I've run into a more complex problem regarding bam to fastq conversion. Somehow my bam files have some unpaired reads mixed with the paired-end reads. I was running Picard but that seems to terminate when it finds unpaired read.
                        Any suggestions on how to proceed now?

                        Thanks in advance.

                        Comment


                        • #13
                          [ EDIT 6/21/2013 ... see later posts for full, improved more robust version ]
                          If you want the "orphan" reads [ i.e. read not properly paired], modify bampe2fqs.c as instructed below ...

                          This will dump "orphan" reads to the file "orphans.fq" in addtion to the pairs in 1.fq and 2.fq

                          1) Uncomment dump_orphans() call.

                          2) Add this code to bampe2fqs.c after line with "static Tree tree;"

                          Remember AVL tree "library" is at http://piumarta.com/software/tree/ as include files and point to your libz and libbam libraries when compiling.

                          =============== insert after Tree declaration ==============
                          int orphcnt = 0; // count of orphans reads

                          void orphan_node_printer(Node *self, void *stream)
                          {
                          fprintf(stream,"@%s\n%s\n+\n%s\n",self->name, self->seq,self->qual);
                          orphcnt++;
                          }
                          void dump_orphans()
                          {
                          FILE *fpo; // orphan fastqs
                          fpo = fopen("orphan.fq","w");
                          TREE_FORWARD_APPLY(&tree, _Node, linkage, orphan_node_printer, fpo);
                          fclose(fpo);
                          fprintf(stderr,"DONE, %d orphans flushed\n",orphcnt);
                          }
                          Last edited by Richard Finney; 06-21-2013, 10:47 AM.

                          Comment


                          • #14
                            @Richard: khash takes smaller memory than most search trees on small key-value pairs. The only implementations that beat khash on memory are stx-tree and my own kbtree.h, both of which are based on B-tree/B+-tree. Nonetheless, if you store large objects (I usually store pointers instead), tree structures can be more memory efficient.

                            Comment


                            • #15
                              Hi Richard,
                              Thanks a lot for the info. Works perfectly.
                              Does this program make any assumptions about the input bam file (sorting on bam file). Does it also work if I have multiple alignments for reads (non-primary alignments)?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X