Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Richard,
    Please have a look at my previous question. I was wondering what happens to those reads whose second is pair is unmapped and also missing from the bam file?
    Thanks.

    Comment


    • #17
      Sorry for a late response, vinay.
      Those reads with no pair in the bam are "orphan reads". See the 05-10-2012 post for outputting orphan reads. "Good" bams should have all pairs whether mapped or not, though sometimes programs drop unmapped reads to save space.

      Comment


      • #18
        Hi,

        I just downloaded the bam2fastq file but I cannot extract it. Is there any problem with the file? Thanks

        --

        tar -xzf bam2fastq-1.1.0.tgz

        gzip: stdin: unexpected end of file
        tar: Child returned status 1
        tar: Error exit delayed from previous errors

        Comment


        • #19
          Hi,

          It worked for me.


          % tar -xvzf bam2fastq-1.1.0.tgz
          bam2fastq-1.1.0/
          bam2fastq-1.1.0/bam2fastq.cpp
          bam2fastq-1.1.0/HISTORY.txt
          bam2fastq-1.1.0/LICENSE
          bam2fastq-1.1.0/Makefile
          bam2fastq-1.1.0/README.txt
          bam2fastq-1.1.0/samtools/
          bam2fastq-1.1.0/samtools/._AUTHORS
          bam2fastq-1.1.0/samtools/AUTHORS
          bam2fastq-1.1.0/samtools/._bam.c
          bam2fastq-1.1.0/samtools/bam.c
          bam2fastq-1.1.0/samtools/._bam.h
          bam2fastq-1.1.0/samtools/bam.h
          bam2fastq-1.1.0/samtools/._bam_aux.c
          bam2fastq-1.1.0/samtools/bam_aux.c
          bam2fastq-1.1.0/samtools/._bam_color.c
          bam2fastq-1.1.0/samtools/bam_color.c
          bam2fastq-1.1.0/samtools/._bam_endian.h
          bam2fastq-1.1.0/samtools/bam_endian.h
          bam2fastq-1.1.0/samtools/._bam_import.c
          bam2fastq-1.1.0/samtools/bam_import.c
          bam2fastq-1.1.0/samtools/._bam_index.c
          bam2fastq-1.1.0/samtools/bam_index.c
          bam2fastq-1.1.0/samtools/._bam_lpileup.c
          bam2fastq-1.1.0/samtools/bam_lpileup.c
          bam2fastq-1.1.0/samtools/._bam_maqcns.c
          bam2fastq-1.1.0/samtools/bam_maqcns.c
          bam2fastq-1.1.0/samtools/._bam_maqcns.h
          bam2fastq-1.1.0/samtools/bam_maqcns.h
          bam2fastq-1.1.0/samtools/._bam_mate.c
          bam2fastq-1.1.0/samtools/bam_mate.c
          bam2fastq-1.1.0/samtools/._bam_md.c
          bam2fastq-1.1.0/samtools/bam_md.c
          bam2fastq-1.1.0/samtools/._bam_pileup.c
          bam2fastq-1.1.0/samtools/bam_pileup.c
          bam2fastq-1.1.0/samtools/._bam_plcmd.c
          bam2fastq-1.1.0/samtools/bam_plcmd.c
          bam2fastq-1.1.0/samtools/._bam_reheader.c
          bam2fastq-1.1.0/samtools/bam_reheader.c
          bam2fastq-1.1.0/samtools/._bam_rmdup.c
          bam2fastq-1.1.0/samtools/bam_rmdup.c
          bam2fastq-1.1.0/samtools/._bam_rmdupse.c
          bam2fastq-1.1.0/samtools/bam_rmdupse.c
          bam2fastq-1.1.0/samtools/._bam_sort.c
          bam2fastq-1.1.0/samtools/bam_sort.c
          bam2fastq-1.1.0/samtools/._bam_stat.c
          bam2fastq-1.1.0/samtools/bam_stat.c
          bam2fastq-1.1.0/samtools/._bam_tview.c
          bam2fastq-1.1.0/samtools/bam_tview.c
          bam2fastq-1.1.0/samtools/._bamtk.c
          bam2fastq-1.1.0/samtools/bamtk.c
          bam2fastq-1.1.0/samtools/._bgzf.c
          bam2fastq-1.1.0/samtools/bgzf.c
          bam2fastq-1.1.0/samtools/._bgzf.h
          bam2fastq-1.1.0/samtools/bgzf.h
          bam2fastq-1.1.0/samtools/._bgzip.c
          bam2fastq-1.1.0/samtools/bgzip.c
          bam2fastq-1.1.0/samtools/._ChangeLog
          bam2fastq-1.1.0/samtools/ChangeLog
          bam2fastq-1.1.0/samtools/COPYING
          bam2fastq-1.1.0/samtools/._examples
          bam2fastq-1.1.0/samtools/examples/
          bam2fastq-1.1.0/samtools/._faidx.c
          bam2fastq-1.1.0/samtools/faidx.c
          bam2fastq-1.1.0/samtools/._faidx.h
          bam2fastq-1.1.0/samtools/faidx.h
          bam2fastq-1.1.0/samtools/._glf.c
          bam2fastq-1.1.0/samtools/glf.c
          bam2fastq-1.1.0/samtools/._glf.h
          bam2fastq-1.1.0/samtools/glf.h
          bam2fastq-1.1.0/samtools/._INSTALL
          bam2fastq-1.1.0/samtools/INSTALL
          bam2fastq-1.1.0/samtools/._kaln.c
          bam2fastq-1.1.0/samtools/kaln.c
          bam2fastq-1.1.0/samtools/._kaln.h
          bam2fastq-1.1.0/samtools/kaln.h
          bam2fastq-1.1.0/samtools/._khash.h
          bam2fastq-1.1.0/samtools/khash.h
          bam2fastq-1.1.0/samtools/._klist.h
          bam2fastq-1.1.0/samtools/klist.h
          bam2fastq-1.1.0/samtools/._knetfile.c
          bam2fastq-1.1.0/samtools/knetfile.c
          bam2fastq-1.1.0/samtools/._knetfile.h
          bam2fastq-1.1.0/samtools/knetfile.h
          bam2fastq-1.1.0/samtools/._kseq.h
          bam2fastq-1.1.0/samtools/kseq.h
          bam2fastq-1.1.0/samtools/._ksort.h
          bam2fastq-1.1.0/samtools/ksort.h
          bam2fastq-1.1.0/samtools/._kstring.c
          bam2fastq-1.1.0/samtools/kstring.c
          bam2fastq-1.1.0/samtools/._kstring.h
          bam2fastq-1.1.0/samtools/kstring.h
          bam2fastq-1.1.0/samtools/._Makefile
          bam2fastq-1.1.0/samtools/Makefile
          bam2fastq-1.1.0/samtools/._Makefile.mingw
          bam2fastq-1.1.0/samtools/Makefile.mingw
          bam2fastq-1.1.0/samtools/._misc
          bam2fastq-1.1.0/samtools/misc/
          bam2fastq-1.1.0/samtools/NEWS
          bam2fastq-1.1.0/samtools/._razf.c
          bam2fastq-1.1.0/samtools/razf.c
          bam2fastq-1.1.0/samtools/._razf.h
          bam2fastq-1.1.0/samtools/razf.h
          bam2fastq-1.1.0/samtools/._razip.c
          bam2fastq-1.1.0/samtools/razip.c
          bam2fastq-1.1.0/samtools/._sam.c
          bam2fastq-1.1.0/samtools/sam.c
          bam2fastq-1.1.0/samtools/._sam.h
          bam2fastq-1.1.0/samtools/sam.h
          bam2fastq-1.1.0/samtools/._sam_header.c
          bam2fastq-1.1.0/samtools/sam_header.c
          bam2fastq-1.1.0/samtools/._sam_header.h
          bam2fastq-1.1.0/samtools/sam_header.h
          bam2fastq-1.1.0/samtools/._sam_view.c
          bam2fastq-1.1.0/samtools/sam_view.c
          bam2fastq-1.1.0/samtools/._samtools.1
          bam2fastq-1.1.0/samtools/samtools.1
          bam2fastq-1.1.0/samtools/samtools.txt
          bam2fastq-1.1.0/samtools/._win32
          bam2fastq-1.1.0/samtools/win32/
          bam2fastq-1.1.0/samtools/win32/._xcurses.h
          bam2fastq-1.1.0/samtools/win32/xcurses.h
          bam2fastq-1.1.0/samtools/win32/._zconf.h
          bam2fastq-1.1.0/samtools/win32/zconf.h
          bam2fastq-1.1.0/samtools/win32/._zlib.h
          bam2fastq-1.1.0/samtools/win32/zlib.h
          bam2fastq-1.1.0/samtools/misc/._blast2sam.pl
          bam2fastq-1.1.0/samtools/misc/blast2sam.pl
          bam2fastq-1.1.0/samtools/misc/._bowtie2sam.pl
          bam2fastq-1.1.0/samtools/misc/bowtie2sam.pl
          bam2fastq-1.1.0/samtools/misc/._export2sam.pl
          bam2fastq-1.1.0/samtools/misc/export2sam.pl
          bam2fastq-1.1.0/samtools/misc/._interpolate_sam.pl
          bam2fastq-1.1.0/samtools/misc/interpolate_sam.pl
          bam2fastq-1.1.0/samtools/misc/._Makefile
          bam2fastq-1.1.0/samtools/misc/Makefile
          bam2fastq-1.1.0/samtools/misc/._maq2sam.c
          bam2fastq-1.1.0/samtools/misc/maq2sam.c
          bam2fastq-1.1.0/samtools/misc/._md5.c
          bam2fastq-1.1.0/samtools/misc/md5.c
          bam2fastq-1.1.0/samtools/misc/._md5.h
          bam2fastq-1.1.0/samtools/misc/md5.h
          bam2fastq-1.1.0/samtools/misc/._md5fa.c
          bam2fastq-1.1.0/samtools/misc/md5fa.c
          bam2fastq-1.1.0/samtools/misc/._novo2sam.pl
          bam2fastq-1.1.0/samtools/misc/novo2sam.pl
          bam2fastq-1.1.0/samtools/misc/._psl2sam.pl
          bam2fastq-1.1.0/samtools/misc/psl2sam.pl
          bam2fastq-1.1.0/samtools/misc/._sam2vcf.pl
          bam2fastq-1.1.0/samtools/misc/sam2vcf.pl
          bam2fastq-1.1.0/samtools/misc/._samtools.pl
          bam2fastq-1.1.0/samtools/misc/samtools.pl
          bam2fastq-1.1.0/samtools/misc/._soap2sam.pl
          bam2fastq-1.1.0/samtools/misc/soap2sam.pl
          bam2fastq-1.1.0/samtools/misc/._varfilter.py
          bam2fastq-1.1.0/samtools/misc/varfilter.py
          bam2fastq-1.1.0/samtools/misc/._wgsim.c
          bam2fastq-1.1.0/samtools/misc/wgsim.c
          bam2fastq-1.1.0/samtools/misc/._wgsim_eval.pl
          bam2fastq-1.1.0/samtools/misc/wgsim_eval.pl
          bam2fastq-1.1.0/samtools/misc/._zoom2sam.pl
          bam2fastq-1.1.0/samtools/misc/zoom2sam.pl
          bam2fastq-1.1.0/samtools/examples/._00README.txt
          bam2fastq-1.1.0/samtools/examples/00README.txt
          bam2fastq-1.1.0/samtools/examples/._bam2bed.c
          bam2fastq-1.1.0/samtools/examples/bam2bed.c
          bam2fastq-1.1.0/samtools/examples/._calDepth.c
          bam2fastq-1.1.0/samtools/examples/calDepth.c
          bam2fastq-1.1.0/samtools/examples/._ex1.fa
          bam2fastq-1.1.0/samtools/examples/ex1.fa
          bam2fastq-1.1.0/samtools/examples/._ex1.sam.gz
          bam2fastq-1.1.0/samtools/examples/ex1.sam.gz
          bam2fastq-1.1.0/samtools/examples/._Makefile
          bam2fastq-1.1.0/samtools/examples/Makefile
          bam2fastq-1.1.0/samtools/examples/._toy.fa
          bam2fastq-1.1.0/samtools/examples/toy.fa
          bam2fastq-1.1.0/samtools/examples/._toy.sam
          bam2fastq-1.1.0/samtools/examples/toy.sam

          Comment


          • #20
            Problem solved!

            The download file at the webpage isn't working. I downloaded again from the "Version History" 1.1.0, is working. Maybe the admin didn't realize it.

            Comment


            • #21
              Have you tried bamtools? I believe the command is:

              bamtools convert -in file1.bam -in file2.bam ... -format fastq >reads.fq

              Comment


              • #22
                how to modify 1.fq to write bamfilename_1.fq?

                Hi Richard,

                How can I modify "1.fq" and "2.fq" to write instead bamfilename_1.fq and bamfilename_2.fq as I have many bam files and I want to preserve the name of the file where fastq files are generated from.

                Sorry I don't know C at all but I find your code very helpful and it runs smoothly.

                Thanks so much.

                Originally posted by Richard Finney View Post
                For the peoples ...

                I've run into similar problems: bam to fastq implementations take up too much memory and take all day to run. I whipped together my own solution.
                This thing is very fast and uses very little memory (it outputs a pair when the second pair is input). It will take a little knowledge of how to point to libraries and headers to get working on your system.

                It uses AVL trees. It only works on paired end data.

                You must install samtools, tree library and possibly zlib. The tree library is at http://piumarta.com/software/tree/


                /*
                how to compile:
                edit line below to point to samtools and tree-1.0 for headers and libraries

                gcc -Wall -O2 -I/h1/finneyr/samtools-0.1.18/ -I./tree-1.0 bampe2fqs.c -o bampe2fqs -lm /h1/finneyr/samtools-0.1.18/libbam.a /h1/finneyr//zlib-1.2.5/libz.a

                example
                ./bampe2fqs /tcga_next_gen/NG_buckets/bucket11/bam/TCGA-AA-3858-01A-01W-0900-09_IlluminaGA-DNASeq_exome.bam

                outputs 2 files : 1.fq and 2.fq which are the paired fastqs

                AVL tree "library" is at http://piumarta.com/software/tree/ (it's not really a library, it's "include" files only)

                */

                #include <stdio.h>
                #include <stdlib.h>
                off_t ftello(FILE *stream); // prototype cuz sam.h complains with -Wall on old systems

                #include "sam.h"
                #include <sys/types.h>
                #include <unistd.h>
                #include <tree.h>


                static char compli(char ch)
                {
                if (ch == 'A') return 'T';
                else if (ch == 'T') return 'A';
                else if (ch == 'C') return 'G';
                else if (ch == 'G') return 'C';
                else if (ch == 'N') return 'N';
                else if (ch == 'a') return 't';
                else if (ch == 't') return 'a';
                else if (ch == 'c') return 'g';
                else if (ch == 'g') return 'c';
                return '.';
                }

                static void reverse_and_compliment(char s[], char outs[], int len)
                {
                register int i,j;

                j = 0;
                i = len - 1;
                while (i >= 0)
                outs[j++] = compli(s[i--]);
                return;
                }

                static void reverse(char s[], char outs[])
                {
                register int i,j;

                i = strlen(s);
                i--;

                j = 0;
                while (i >= 0)
                {
                outs[j] = s[i];
                j++;
                i--;
                }
                outs[j] = (char)0;
                strcpy(s,outs);
                return;
                }


                static int pid;
                static FILE *fp1; // first mate pair fastq output file
                static FILE *fp2; // second mate pair fastq output file

                struct data_type
                {
                char *name;
                int flag;
                char *seq;
                char *qual;
                };

                static long int incnt = 0;
                static long int outcnt = 0;
                static long int live = 0;

                static void info(char *name)
                {
                fprintf(stderr,"note: output %ld in=%ld live=%ld fq pairs , %s \n",outcnt,incnt,live,name);
                fflush(stderr);

                #if 1
                // how much memory do we use ?
                char m[512];
                sprintf(m,"pmap %d | grep total",pid);
                system(m);
                fflush(stdout);
                #endif
                }

                #if 0 // left to the reader as an exercise, shouldn't need this on proper bams
                static FILE *fpo; // orphan fastqs

                void dump_orphans()
                {
                fpo = fopen("orphan.fq","w");
                fclose(fpo);
                fprintf(stderr,"DONE, %d unflushed\n",cnt);
                }
                #endif


                typedef struct _Node
                {
                char *name;
                char *seq;
                char *qual;
                int flag;
                TREE_ENTRY(_Node) linkage;
                } Node;

                typedef TREE_HEAD(_Tree, _Node) Tree;

                TREE_DEFINE(_Node, linkage);

                int fence = 0;

                char seq[500];
                char qual[500];
                int flag;

                Node *Node_new(char *name)
                {
                Node *self = (Node *)calloc(1, sizeof(Node));
                self->name = strdup(name);
                self->seq = strdup(seq);
                self->qual = strdup(qual);
                self->flag = flag;
                return self;
                }

                int Node_compare(Node *lhs, Node *rhs)
                {
                return strcmp(lhs->name, rhs->name);
                }

                void Node_printer(Node *self, void *stream)
                {
                if (fence)
                {
                fprintf((FILE *)stream, "%s", self->name);
                --fence;
                }
                }


                int firsttime = 1;
                static Tree tree;

                void static add(char *name,int flag_arg, char *seqarg, char *qualarg)
                {
                char *z1,*z2,*z3;
                flag = flag_arg;

                if (firsttime == 1)
                {
                static Tree treep = TREE_INITIALIZER(Node_compare);
                tree = treep;
                firsttime = 0;
                pid = getpid();
                }
                incnt++;

                if ((incnt % 10000000) == 0) info(name);
                Node test = { name };
                Node *ptr = TREE_FIND(&tree, _Node, linkage, &test);
                if (ptr)
                {
                outcnt++;
                if (flag_arg == 1)
                {
                fprintf(fp1,"@%s/1\n%s\n+\n%s\n",ptr->name, ptr->seq,ptr->qual);
                fprintf(fp2,"@%s/2\n%s\n+\n%s\n",name, seqarg,qualarg);
                }
                else
                {
                fprintf(fp1,"@%s/1\n%s\n+\n%s\n",name, seqarg,qualarg);
                fprintf(fp2,"@%s/2\n%s\n+\n%s\n",ptr->name, ptr->seq,ptr->qual);
                }

                z1 = ptr->name;
                z2 = ptr->seq;
                z3 = ptr->qual;
                TREE_REMOVE(&tree, _Node, linkage, ptr);
                if (z1) free(z1);
                if (z2) free(z2);
                if (z3) free(z3);
                free(ptr);

                live--;
                return;
                }

                TREE_INSERT(&tree, _Node, linkage, Node_new(name));
                live++;
                return;
                }


                static int fetch_func(const bam1_t *b, void *data)
                {
                uint32_t *cigar = bam1_cigar(b);
                const bam1_core_t *c = &b->core;
                int i, l;
                // NO!!!!!!!!! if (b->core.tid < 0) return 0;
                for (i = l = 0; i < c->n_cigar; ++i) {
                int op = cigar[i]&0xf;
                if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                l += cigar[i]>>4;
                }

                uint8_t *s = bam1_seq(b);
                uint8_t *t = bam1_qual(b);
                qual[0] = (char)0;
                seq[0] = (char)0;
                for (i = 0; i < c->l_qseq; i++) seq[i] = bam_nt16_rev_table[bam1_seqi(s, i)];
                seq[i] = (char)0;
                i = 0;
                for (i = 0; i < c->l_qseq; ++i) qual[i] = t[i] + 33;
                qual[i] = (char)0;
                if (c->flag&BAM_FREVERSE)
                {
                char tmps[512];

                reverse_and_compliment(seq, tmps, strlen(seq));
                strcpy(seq,tmps);
                reverse(qual, tmps);
                strcpy(qual,tmps);
                }

                if (c->flag&BAM_FREAD1) add(bam1_qname(b),0,seq,qual);
                else add(bam1_qname(b),1,seq,qual);
                /*
                printf("%s\t%d\t%d\t%s\t%d\t%c\n", fp->header->target_name[c->tid],
                c->pos, c->pos + l, bam1_qname(b), c->qual, (c->flag&BAM_FREVERSE)? '-' : '+');
                */
                return 0;
                }

                int main(int argc, char *argv[])
                {
                samfile_t *fp;

                if (argc == 1) {
                fprintf(stderr, "Usage: bampe2fq <in.bam> [region]\n");
                return 1;
                }
                if ((fp = samopen(argv[1], "rb", 0)) == 0) {
                fprintf(stderr, "bampe2fq: Fail to open BAM file %s\n", argv[1]);
                return 1;
                }

                if (argc == 2) { /* if a region is not specified */
                fp1 = fopen("1.fq","w");
                fp2 = fopen("2.fq","w");
                bam1_t *b = bam_init1();
                while (samread(fp, b) >= 0) fetch_func(b, fp);
                bam_destroy1(b);
                fclose(fp1);
                fclose(fp2);
                // dump_orphans(); not working yet
                info("done");
                printf("DONE\n");
                } else {
                int ref, beg, end;
                bam_index_t *idx;
                if ((idx = bam_index_load(argv[1])) == 0) {
                fprintf(stderr, "bampe2fq: BAM indexing file is not available.\n");
                return 1;
                }
                bam_parse_region(fp->header, argv[2], &ref, &beg, &end);
                if (ref < 0) {
                fprintf(stderr, "bampe2fq: Invalid region %s\n", argv[2]);
                return 1;
                }
                bam_fetch(fp->x.bam, idx, ref, beg, end, fp, fetch_func);
                bam_index_destroy(idx);
                }
                samclose(fp);
                return 0;
                }





                Comment


                • #23
                  [ SEE my next post for full version , edit 6/21/2013 ]
                  You can pass arguments via the argc/argv mechanism. Duckduck/google/bing for more info.

                  You'll need to modify these 3 lines to ~~*something like*~~ this ...

                  if (argc == 4) { /* if a region is not specified -- comment is now way off */
                  fp1 = fopen(argv[2],"w");
                  fp2 = fopen(argv[3],"w");


                  Usage would be
                  ./bampe2fqs in.bam outputfirstpair.fq outputsecondpair.fq

                  In this situation, the args are 4 in number (argc==4), note that C is "zero based" arrays.

                  argv[0] = program name
                  argv[1] = bamfile
                  argv[2] = outfile1
                  argv[3] = outfile2


                  argc is "argument count", this is an integer, usually a small number.
                  argv is "argument vector", this is an array of pointers to pointers of type char; it's a little tricky for newbies. Just think of it as an array of strings.
                  Last edited by Richard Finney; 06-21-2013, 10:48 AM.

                  Comment


                  • #24
                    Thanks very much, I am able to get it to work after modification.

                    Comment


                    • #25
                      Biobambam (https://github.com/gt1/biobambam) also includes a bamtofastq program. I don't know about speed, but I believe it keeps a hash of recent sequences to allow output of paired end data and uses temporary files for when sequences aren't neighbouring.

                      This should avoid memory issues when running on extremely large sets of ones with a large degree of rearrangements and/or chimeras.

                      Comment


                      • #26
                        Improved version ... now called bampe2fqworphans

                        supports regions, must provide output file names as arguments.
                        Code:
                        /*
                        requires :
                            http://piumarta.com/software/tree/
                        
                        how to make:
                            Download an install Ian Piumarta's tree library from http://piumarta.com/software/tree/  (it has been known to go off line for a few days).
                            Edit line below to point to samtools and tree-1.0 for headers and libraries
                        
                            gcc -Wall -O2 -I/h1/finneyr/samtools-0.1.18/ -I./tree-1.0 bampe2fqworphans.c -o bampe2fqworphans -lm /h1/finneyr/samtools-0.1.18/libbam.a /h1/finneyr//zlib-1.2.5/libz.a
                        
                        usage:
                             bampe2fqworphans in.bam out_pair1.fq out_pair2.fq out_orphans.fq [region]
                        
                        examples
                            ./bampe2fqworphans input.bam 1.fq 2.fq orphans.fq
                            ./bampe2fqworphans cancercure.bam 1.fq 2.fq orphans.fq chr7:10000-20000
                        
                        modifed: 9/24/2013, bad compli() call in reverse remove 
                        
                        */
                        
                        #include <stdio.h>
                        #include <stdlib.h>
                        off_t ftello(FILE *stream); // shuts up messages in old compilers
                        
                        #include "sam.h"
                        #include <sys/types.h>
                        #include <unistd.h>
                        #include <tree.h>  // piumarta's AVL tree routines
                        
                        static char compli(char ch) // compliment sequence
                        {
                            if (ch == 'A') return 'T';
                            else if (ch == 'T') return 'A';
                            else if (ch == 'C') return 'G';
                            else if (ch == 'G') return 'C';
                            else if (ch == 'N') return 'N';
                            else if (ch == 'a') return 't';
                            else if (ch == 't') return 'a';
                            else if (ch == 'c') return 'g';
                            else if (ch == 'g') return 'c';
                            return '.';
                        }
                        
                        static void reverse_and_compliment(char s[], char outs[])
                        {
                            register int i,j;
                        
                            i = strlen(s);
                            i--;
                            j = 0;
                            while (i >= 0)
                            {
                                outs[j] = compli(s[i]);
                                j++;
                                i--;
                            }
                            outs[j] = (char)0;
                            return;
                        }
                        
                        static void reverse(char s[], char outs[])
                        {
                            register int i,j;
                        
                            i = strlen(s);
                            i--;
                            j = 0;
                            while (i >= 0)
                            {
                                outs[j] =  s[i] ;
                                j++;
                                i--;
                            }
                            outs[j] = (char)0;
                            return;
                        }
                        
                        static int pid;
                        static FILE *fp1;  // first mate pair fastq output file
                        static FILE *fp2;  // second mate pair fastq output file
                        
                        struct data_type
                        {
                            char *name;
                            int flag;
                            char *seq;
                            char *qual;
                        };
                        
                        static long int incnt = 0;
                        static long int outcnt = 0;
                        static long int live = 0;
                        
                        static void info(char *name)
                        {
                        // you can turn on next line with "if 1" to turn OFF output , if you can't stand the progress "entertainment" output
                        #if 0
                        return;
                        #endif
                        
                        fprintf(stderr,"output %ld in=%ld live=%ld fq pairs , %s \n",outcnt,incnt,live,name);
                        fflush(stderr);
                        
                        #if 0 // turn this on to view memory statistics - see the "pmap" routine available in linux/unix
                        // how much memory do we use ?
                        char m[512];
                        sprintf(m,"pmap %d | grep total",pid);
                        system(m);
                        fflush(stdout);
                        #endif
                        }
                        
                        
                        typedef struct _Node
                        {
                            char *name;
                            char *seq;
                            char *qual;
                            int flag;
                            TREE_ENTRY(_Node)      linkage;
                        } Node;
                        
                        typedef TREE_HEAD(_Tree, _Node) Tree;
                        
                        TREE_DEFINE(_Node, linkage);
                        
                        int fence = 0;
                        
                        char global_seq[500];
                        char global_qual[500];
                        int flag;
                        
                        #if 0
                        // debug - testing for if strings are not null terminated
                        void *mystrdup(void *src_ptr, int len)
                        {
                            char *dest;
                            int SIZE;
                        
                            SIZE = len + 2;
                            dest = (void *)malloc(SIZE);
                            if (!dest) {fprintf(stderr,"ERROR - no more memory\n");  fflush(stderr);  exit(0); } ;
                            memset(dest,0,SIZE);
                            strcpy(dest,src_ptr);
                            return (void *)dest;
                        }
                        #endif
                        
                        Node *Node_new(char *name, int len)
                        {
                               Node *self = (Node *)calloc(1, sizeof(Node));
                        #if 0
                               self->name = (void *)mystrdup(name,strlen(name));
                               self->seq = (void *)mystrdup(global_seq,len);
                               self->qual = (void *)mystrdup(global_qual,len);
                        #else
                               self->name = (void *)strdup(name);
                               self->seq = (void *)strdup(global_seq);
                               self->qual = (void *)strdup(global_qual);
                        #endif
                               self->flag = flag;
                               return self;
                        }
                        
                        int Node_compare(Node *lhs, Node *rhs)
                        {
                            return strcmp(lhs->name, rhs->name);
                        }
                        
                        #if 0
                        // debug for dumping the tree
                        void Node_printer(Node *self, void *stream)
                        {
                            if (fence)
                            {
                                fprintf((FILE *)stream, "%s", self->name);
                                --fence;
                            }
                        }
                        #endif
                        
                        int firsttime = 1;
                        static Tree tree;
                        
                        int orphcnt = 0; // count of orphans
                        
                        void orphan_node_printer(Node *self, void *stream)
                        {
                            fprintf(stream,"@%s\n%s\n+\n%s\n",self->name, self->seq,self->qual);
                            orphcnt++;
                        }
                        
                        void dump_orphans(char *orphan_file_name)
                        {
                            FILE *fpo;  // orphan fastqs
                            fpo = fopen(orphan_file_name,"w");
                            TREE_FORWARD_APPLY(&tree, _Node, linkage, orphan_node_printer, fpo);
                            fclose(fpo);
                            fprintf(stderr,"Note: %d orphans flushed\n",orphcnt);
                            fflush(stderr);
                        }
                        
                        #if 0
                        // debug routine for checking if sequence got messed up
                        void seqsanity(char *s, int len, char *msg1, char *msg2)
                        {
                            int i;
                        
                                for (i = 0; i < len; i++)
                                {
                                   if ((s[i] == 'C')
                                    || (s[i] == 'A')
                                    || (s[i] == 'G')
                                    || (s[i] == 'T')
                                    || (s[i] == 'N')
                                      )
                                   {
                                   }
                                   else
                                   {
                                   fflush(stdout);
                                   fprintf(stderr,"ERROR: at %d [%s][%s][%s]\n",i,s,msg1,msg2);
                                   fprintf(stderr,"ERROR: found char as decimal [%d] \n", s[i]);
                                   fflush(stderr);
                                   exit(0);
                                   }
                                }
                        }
                        #endif
                        
                        void static add(char *name,int flag_arg, char *seqarg, char *qualarg, int len)
                        {
                            char *z1,*z2,*z3;
                            flag = flag_arg;
                        
                            if (firsttime == 1)
                            {
                                static Tree treep = TREE_INITIALIZER(Node_compare);
                                tree = treep;
                                firsttime = 0;
                                pid = getpid();
                            }
                            incnt++;
                        
                        if ((incnt % 10000000) == 0) info(name);
                            Node test = { name };
                            Node *ptr = TREE_FIND(&tree, _Node, linkage, &test);
                            if (ptr)
                            {
                                outcnt++;
                                if (flag_arg == 1)
                                {
                                    fprintf(fp1,"@%s/1\n%s\n+\n%s\n",ptr->name, ptr->seq,ptr->qual);
                                    fprintf(fp2,"@%s/2\n%s\n+\n%s\n",name, seqarg,qualarg);
                                }
                                else
                                {
                                    fprintf(fp1,"@%s/1\n%s\n+\n%s\n",name, seqarg,qualarg);
                                    fprintf(fp2,"@%s/2\n%s\n+\n%s\n",ptr->name, ptr->seq,ptr->qual);
                                }
                        
                                z1 = ptr->name;
                                z2 = ptr->seq;
                                z3 = ptr->qual;
                                TREE_REMOVE(&tree, _Node, linkage, ptr);
                                if (z1) free(z1);
                                if (z2) free(z2);
                                if (z3) free(z3);
                                free(ptr);
                        
                                live--;
                                return;
                            }
                        
                            TREE_INSERT(&tree, _Node, linkage, Node_new(name,len));
                            live++;
                            return;
                        }
                        
                        
                        static int fetch_func(const bam1_t *b, void *data)
                        {
                            uint32_t *cigar = bam1_cigar(b);
                            const bam1_core_t *c = &b->core;
                            int i, l;
                        // NO!!!!!!!!!if (b->core.tid < 0) return 0;
                            for (i = l = 0; i < c->n_cigar; ++i) {
                                int op = cigar[i]&0xf;
                                if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                                    l += cigar[i]>>4;
                            }
                        
                        // l_qseq = length of the query sequence
                        // l_qname = length of the query name
                        
                                uint8_t *s = bam1_seq(b);
                                uint8_t *t = bam1_qual(b);
                                global_qual[0] = (char)0;
                                global_seq[0] = (char)0;
                                for (i = 0; i < c->l_qseq; i++) global_seq[i] = bam_nt16_rev_table[bam1_seqi(s, i)];
                                global_seq[i] = (char)0;
                                i = 0;
                            for (i = 0; i < c->l_qseq; ++i) global_qual[i] = t[i] + 33;
                                global_qual[i] = (char)0;
                        if (c->flag&BAM_FREVERSE)
                        {
                            char tmps[512];
                        
                            reverse_and_compliment(global_seq, tmps);
                            strcpy(global_seq,tmps);
                            reverse(global_qual, tmps);
                            strcpy(global_qual,tmps);
                        }
                        
                                if (c->flag&BAM_FREAD1) add(bam1_qname(b),0,global_seq,global_qual,c->l_qseq);
                                else                    add(bam1_qname(b),1,global_seq,global_qual,c->l_qseq);
                        /*
                            printf("%s\t%d\t%d\t%s\t%d\t%c\n", fp->header->target_name[c->tid],
                                c->pos, c->pos + l, bam1_qname(b), c->qual, (c->flag&BAM_FREVERSE)? '-' : '+');
                        */
                            return 0;
                        }
                        
                        int main(int argc, char *argv[])
                        {
                            samfile_t *fp;
                        
                            if ( (argc < 5) || (argc > 6) )
                            {
                                fprintf(stderr, "Usage: bampe2fqworphans in.bam pair1.fq pair2.fq orphans.fq [region]\n");
                                fprintf(stderr, "Example without region: ./bampe2fqworphans cure2cancer.bam pair1.fq pair2.fq orphans.fq\n");
                                fprintf(stderr, "Example with region: ./bampe2fqworphans cure2cancer.bam pair1.fq pair2.fq orphans.fq chr19:chr7:55054219-55242525\n");
                                fprintf(stderr, "Example: ./bampe2fqworphans /tcga_next_gen/NG_buckets/bucket11/bam/TCGA-AA-3858-01A-01W-0900-09_IlluminaGA-DNASeq_exome.bam  1.fq 2.fq orphans.fq  chr19:chr7:55054219-55242525\n");
                                return 1;
                            }
                            if ((fp = samopen(argv[1], "rb", 0)) == 0)
                            {
                                fprintf(stderr, "bampe2fqworphans: Fail to open BAM file %s\n", argv[1]);
                                return 1;
                            }
                        
                            fp1 = fopen(argv[2],"w");
                            fp2 = fopen(argv[3],"w");
                            if (argc == 5)
                            {
                                bam1_t *b = bam_init1();
                                while (samread(fp, b) >= 0) fetch_func(b, fp);
                                bam_destroy1(b);
                                fclose(fp1);
                                fclose(fp2);
                                dump_orphans(argv[4]);  // test
                                info("done");
                                printf("DONE\n");
                            } else {
                                int ref, beg, end;
                                bam_index_t *idx;
                                if ((idx = bam_index_load(argv[1])) == 0) {
                                    fprintf(stderr, "bampe2fqworphans: BAM indexing file is not available.\n");
                                    return 1;
                                }
                        
                                bam_parse_region(fp->header, argv[5], &ref, &beg, &end);
                                if (ref < 0) {
                                    fprintf(stderr, "bampe2fqworphans: Invalid region %s\n", argv[2]);
                                    return 1;
                                }
                                bam_fetch(fp->x.bam, idx, ref, beg, end, fp, fetch_func);
                                fclose(fp1);
                                fclose(fp2);
                                dump_orphans(argv[4]);  // test
                                info("done");
                                printf("DONE\n");
                                bam_index_destroy(idx);
                            }
                            samclose(fp);
                            return 0;
                        }
                        I should really open a github account.
                        Last edited by Richard Finney; 09-24-2013, 10:05 AM.

                        Comment


                        • #27
                          Originally posted by Richard Finney View Post
                          Improved version ... now called bampe2fqworphans

                          supports regions, must provide output file names as arguments.
                          Code:
                          /*
                          requires :
                              http://piumarta.com/software/tree/
                          
                          how to make:
                              Download an install Ian Piumarta's tree library from http://piumarta.com/software/tree/  (it has been known to go off line for a few days).
                              Edit line below to point to samtools and tree-1.0 for headers and libraries
                          
                              gcc -Wall -O2 -I/h1/finneyr/samtools-0.1.18/ -I./tree-1.0 bampe2fqworphans.c -o bampe2fqworphans -lm /h1/finneyr/samtools-0.1.18/libbam.a /h1/finneyr//zlib-1.2.5/libz.a
                          
                          usage:
                               bampe2fqworphans in.bam out_pair1.fq out_pair2.fq out_orphans.fq [region]
                          
                          examples
                              ./bampe2fqworphans input.bam 1.fq 2.fq orphans.fq
                              ./bampe2fqworphans cancercure.bam 1.fq 2.fq orphans.fq chr7:10000-20000
                          */
                          
                          #include <stdio.h>
                          #include <stdlib.h>
                          off_t ftello(FILE *stream); // shuts up messages in old compilers
                          
                          #include "sam.h"
                          #include <sys/types.h>
                          #include <unistd.h>
                          #include <tree.h>  // piumarta's AVL tree routines
                          
                          static char compli(char ch) // compliment sequence
                          {
                              if (ch == 'A') return 'T';
                              else if (ch == 'T') return 'A';
                              else if (ch == 'C') return 'G';
                              else if (ch == 'G') return 'C';
                              else if (ch == 'N') return 'N';
                              else if (ch == 'a') return 't';
                              else if (ch == 't') return 'a';
                              else if (ch == 'c') return 'g';
                              else if (ch == 'g') return 'c';
                              return '.';
                          }
                          
                          static void reverse_and_compliment(char s[], char outs[])
                          {
                              register int i,j;
                          
                              i = strlen(s);
                              i--;
                              j = 0;
                              while (i >= 0)
                              {
                                  outs[j] = compli(s[i]);
                                  j++;
                                  i--;
                              }
                              outs[j] = (char)0;
                              return;
                          }
                          
                          static void reverse(char s[], char outs[])
                          {
                              register int i,j;
                          
                              i = strlen(s);
                              i--;
                              j = 0;
                              while (i >= 0)
                              {
                                  outs[j] = compli(s[i]);
                                  j++;
                                  i--;
                              }
                              outs[j] = (char)0;
                              return;
                          }
                          
                          static int pid;
                          static FILE *fp1;  // first mate pair fastq output file
                          static FILE *fp2;  // second mate pair fastq output file
                          
                          struct data_type
                          {
                              char *name;
                              int flag;
                              char *seq;
                              char *qual;
                          };
                          
                          static long int incnt = 0;
                          static long int outcnt = 0;
                          static long int live = 0;
                          
                          static void info(char *name)
                          {
                          // you can turn on next line with "if 1" to turn OFF output , if you can't stand the progress "entertainment" output
                          #if 0
                          return;
                          #endif
                          
                          fprintf(stderr,"output %ld in=%ld live=%ld fq pairs , %s \n",outcnt,incnt,live,name);
                          fflush(stderr);
                          
                          #if 0 // turn this on to view memory statistics - see the "pmap" routine available in linux/unix
                          // how much memory do we use ?
                          char m[512];
                          sprintf(m,"pmap %d | grep total",pid);
                          system(m);
                          fflush(stdout);
                          #endif
                          }
                          
                          
                          typedef struct _Node
                          {
                              char *name;
                              char *seq;
                              char *qual;
                              int flag;
                              TREE_ENTRY(_Node)      linkage;
                          } Node;
                          
                          typedef TREE_HEAD(_Tree, _Node) Tree;
                          
                          TREE_DEFINE(_Node, linkage);
                          
                          int fence = 0;
                          
                          char global_seq[500];
                          char global_qual[500];
                          int flag;
                          
                          #if 0
                          // debug - testing for if strings are not null terminated
                          void *mystrdup(void *src_ptr, int len)
                          {
                              char *dest;
                              int SIZE;
                          
                              SIZE = len + 2;
                              dest = (void *)malloc(SIZE);
                              if (!dest) {fprintf(stderr,"ERROR - no more memory\n");  fflush(stderr);  exit(0); } ;
                              memset(dest,0,SIZE);
                              strcpy(dest,src_ptr);
                              return (void *)dest;
                          }
                          #endif
                          
                          Node *Node_new(char *name, int len)
                          {
                                 Node *self = (Node *)calloc(1, sizeof(Node));
                          #if 0
                                 self->name = (void *)mystrdup(name,strlen(name));
                                 self->seq = (void *)mystrdup(global_seq,len);
                                 self->qual = (void *)mystrdup(global_qual,len);
                          #else
                                 self->name = (void *)strdup(name);
                                 self->seq = (void *)strdup(global_seq);
                                 self->qual = (void *)strdup(global_qual);
                          #endif
                                 self->flag = flag;
                                 return self;
                          }
                          
                          int Node_compare(Node *lhs, Node *rhs)
                          {
                              return strcmp(lhs->name, rhs->name);
                          }
                          
                          #if 0
                          // debug for dumping the tree
                          void Node_printer(Node *self, void *stream)
                          {
                              if (fence)
                              {
                                  fprintf((FILE *)stream, "%s", self->name);
                                  --fence;
                              }
                          }
                          #endif
                          
                          int firsttime = 1;
                          static Tree tree;
                          
                          int orphcnt = 0; // count of orphans
                          
                          void orphan_node_printer(Node *self, void *stream)
                          {
                              fprintf(stream,"@%s\n%s\n+\n%s\n",self->name, self->seq,self->qual);
                              orphcnt++;
                          }
                          
                          void dump_orphans(char *orphan_file_name)
                          {
                              FILE *fpo;  // orphan fastqs
                              fpo = fopen(orphan_file_name,"w");
                              TREE_FORWARD_APPLY(&tree, _Node, linkage, orphan_node_printer, fpo);
                              fclose(fpo);
                              fprintf(stderr,"Note: %d orphans flushed\n",orphcnt);
                              fflush(stderr);
                          }
                          
                          #if 0
                          // debug routine for checking if sequence got messed up
                          void seqsanity(char *s, int len, char *msg1, char *msg2)
                          {
                              int i;
                          
                                  for (i = 0; i < len; i++)
                                  {
                                     if ((s[i] == 'C')
                                      || (s[i] == 'A')
                                      || (s[i] == 'G')
                                      || (s[i] == 'T')
                                      || (s[i] == 'N')
                                        )
                                     {
                                     }
                                     else
                                     {
                                     fflush(stdout);
                                     fprintf(stderr,"ERROR: at %d [%s][%s][%s]\n",i,s,msg1,msg2);
                                     fprintf(stderr,"ERROR: found char as decimal [%d] \n", s[i]);
                                     fflush(stderr);
                                     exit(0);
                                     }
                                  }
                          }
                          #endif
                          
                          void static add(char *name,int flag_arg, char *seqarg, char *qualarg, int len)
                          {
                              char *z1,*z2,*z3;
                              flag = flag_arg;
                          
                              if (firsttime == 1)
                              {
                                  static Tree treep = TREE_INITIALIZER(Node_compare);
                                  tree = treep;
                                  firsttime = 0;
                                  pid = getpid();
                              }
                              incnt++;
                          
                          if ((incnt % 10000000) == 0) info(name);
                              Node test = { name };
                              Node *ptr = TREE_FIND(&tree, _Node, linkage, &test);
                              if (ptr)
                              {
                                  outcnt++;
                                  if (flag_arg == 1)
                                  {
                                      fprintf(fp1,"@%s/1\n%s\n+\n%s\n",ptr->name, ptr->seq,ptr->qual);
                                      fprintf(fp2,"@%s/2\n%s\n+\n%s\n",name, seqarg,qualarg);
                                  }
                                  else
                                  {
                                      fprintf(fp1,"@%s/1\n%s\n+\n%s\n",name, seqarg,qualarg);
                                      fprintf(fp2,"@%s/2\n%s\n+\n%s\n",ptr->name, ptr->seq,ptr->qual);
                                  }
                          
                                  z1 = ptr->name;
                                  z2 = ptr->seq;
                                  z3 = ptr->qual;
                                  TREE_REMOVE(&tree, _Node, linkage, ptr);
                                  if (z1) free(z1);
                                  if (z2) free(z2);
                                  if (z3) free(z3);
                                  free(ptr);
                          
                                  live--;
                                  return;
                              }
                          
                              TREE_INSERT(&tree, _Node, linkage, Node_new(name,len));
                              live++;
                              return;
                          }
                          
                          
                          static int fetch_func(const bam1_t *b, void *data)
                          {
                              uint32_t *cigar = bam1_cigar(b);
                              const bam1_core_t *c = &b->core;
                              int i, l;
                          // NO!!!!!!!!!if (b->core.tid < 0) return 0;
                              for (i = l = 0; i < c->n_cigar; ++i) {
                                  int op = cigar[i]&0xf;
                                  if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                                      l += cigar[i]>>4;
                              }
                          
                          // l_qseq = length of the query sequence
                          // l_qname = length of the query name
                          
                                  uint8_t *s = bam1_seq(b);
                                  uint8_t *t = bam1_qual(b);
                                  global_qual[0] = (char)0;
                                  global_seq[0] = (char)0;
                                  for (i = 0; i < c->l_qseq; i++) global_seq[i] = bam_nt16_rev_table[bam1_seqi(s, i)];
                                  global_seq[i] = (char)0;
                                  i = 0;
                              for (i = 0; i < c->l_qseq; ++i) global_qual[i] = t[i] + 33;
                                  global_qual[i] = (char)0;
                          if (c->flag&BAM_FREVERSE)
                          {
                              char tmps[512];
                          
                              reverse_and_compliment(global_seq, tmps);
                              strcpy(global_seq,tmps);
                              reverse(global_qual, tmps);
                              strcpy(global_qual,tmps);
                          }
                          
                                  if (c->flag&BAM_FREAD1) add(bam1_qname(b),0,global_seq,global_qual,c->l_qseq);
                                  else                    add(bam1_qname(b),1,global_seq,global_qual,c->l_qseq);
                          /*
                              printf("%s\t%d\t%d\t%s\t%d\t%c\n", fp->header->target_name[c->tid],
                                  c->pos, c->pos + l, bam1_qname(b), c->qual, (c->flag&BAM_FREVERSE)? '-' : '+');
                          */
                              return 0;
                          }
                          
                          int main(int argc, char *argv[])
                          {
                              samfile_t *fp;
                          
                              if ( (argc < 5) || (argc > 6) )
                              {
                                  fprintf(stderr, "Usage: bampe2fqworphans in.bam pair1.fq pair2.fq orphans.fq [region]\n");
                                  fprintf(stderr, "Example without region: ./bampe2fqworphans cure2cancer.bam pair1.fq pair2.fq orphans.fq\n");
                                  fprintf(stderr, "Example with region: ./bampe2fqworphans cure2cancer.bam pair1.fq pair2.fq orphans.fq chr19:chr7:55054219-55242525\n");
                                  fprintf(stderr, "Example: ./bampe2fqworphans /tcga_next_gen/NG_buckets/bucket11/bam/TCGA-AA-3858-01A-01W-0900-09_IlluminaGA-DNASeq_exome.bam  1.fq 2.fq orphans.fq  chr19:chr7:55054219-55242525\n");
                                  return 1;
                              }
                              if ((fp = samopen(argv[1], "rb", 0)) == 0)
                              {
                                  fprintf(stderr, "bampe2fqworphans: Fail to open BAM file %s\n", argv[1]);
                                  return 1;
                              }
                          
                              fp1 = fopen(argv[2],"w");
                              fp2 = fopen(argv[3],"w");
                              if (argc == 5)
                              {
                                  bam1_t *b = bam_init1();
                                  while (samread(fp, b) >= 0) fetch_func(b, fp);
                                  bam_destroy1(b);
                                  fclose(fp1);
                                  fclose(fp2);
                                  dump_orphans(argv[4]);  // test
                                  info("done");
                                  printf("DONE\n");
                              } else {
                                  int ref, beg, end;
                                  bam_index_t *idx;
                                  if ((idx = bam_index_load(argv[1])) == 0) {
                                      fprintf(stderr, "bampe2fqworphans: BAM indexing file is not available.\n");
                                      return 1;
                                  }
                          
                                  bam_parse_region(fp->header, argv[5], &ref, &beg, &end);
                                  if (ref < 0) {
                                      fprintf(stderr, "bampe2fqworphans: Invalid region %s\n", argv[2]);
                                      return 1;
                                  }
                                  bam_fetch(fp->x.bam, idx, ref, beg, end, fp, fetch_func);
                                  fclose(fp1);
                                  fclose(fp2);
                                  dump_orphans(argv[4]);  // test
                                  info("done");
                                  printf("DONE\n");
                                  bam_index_destroy(idx);
                              }
                              samclose(fp);
                              return 0;
                          }
                          I should really open a github account.
                          Thanks Richard for this. I just got it installed and giving it a go and it is indeed much faster than other tools I've tried (e.g. PICARD, Bedtools, Hydra-SV). Also, the small memory footprint is very impressive compared to other tools.

                          Comment


                          • #28
                            Originally posted by Richard Finney View Post
                            Improved version ... now called bampe2fqworphans

                            supports regions, must provide output file names as arguments.
                            ...
                            Hi Richard,

                            Does the bampe2fqworphans require that the input bam be named sorted or be processed in any sort of way to work properly? I just finished extract fastq files from a fairly large bam file (78G) and the extracted, gzipped paired end fastq files are 26G and 27G respectively.

                            I am investigating the actual number of paired-end reads in the files, but the fact that the size of the gzipped fastq files are so much smaller makes me think that I should have been name sorting my bam file first? Is that correct?

                            Comment


                            • #29
                              Originally posted by fongchun View Post
                              Hi Richard,

                              Does the bampe2fqworphans require that the input bam be named sorted or be processed in any sort of way to work properly? I just finished extract fastq files from a fairly large bam file (78G) and the extracted, gzipped paired end fastq files are 26G and 27G respectively.

                              I am investigating the actual number of paired-end reads in the files, but the fact that the size of the gzipped fastq files are so much smaller makes me think that I should have been name sorting my bam file first? Is that correct?
                              Looking at the source code of bampe2fqworphans sorting by name should not be necessary. It is quite possible for the extracted compressed fastq files to be smaller than the BAM file, as the BAM file contains more information (mapping positions, flags, tags, etc).

                              If you are looking into converting large amounts of BAM files to FastQ you may also want to consider biobambam as mentioned by James above. According to my tests it is faster than bampe2fqworphans (by a factor of more than 2 on for instance ftp://ftp.1000genomes.ebi.ac.uk/vol1...e.20120522.bam). It also does not run into trouble for high depth files as Picard or bampe2fqworphans do as it does not keep an arbitrarily larger table/tree in internal memory.

                              Comment


                              • #30
                                Thanks for that information.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X