Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quickly Reading Through a Bam file using C++ programme

    Hi, I am currently trying to write some programme that will transverse the whole bam file in a short period of time.

    The standard samtools api works extreme fast if I use bam_fetch (I can read through the whole transcriptom regions (gtf file from UCSC) within half an hour using a 7gb input)

    The problem is, I want to detect reads in regions not specified by the transcriptom region (not in the user input). If I use the standard function in reading the reads one after another, it will take me around 2 hours. Is there any suggestion for me to speed the process up?

    Thank you!

  • #2
    So why don't you try the bam_fetch API but using the inter-gene regions?

    Comment


    • #3
      There are a few problems I noted with the bam_fetch, the first one most notably is that the reads returned from the bam_fetch functions sometimes behave rather funny. There was a time when I got read that has CIGAR strings that is not found in the sam file (I samtools view the file and grep the read name, the CIGAR was different, there is an additional alignment that pop up)
      Secondly, I would supposed that if a read is spamming across the inter-gene regions and the annotated region, it will got counted twice? In order to record and prevent such situation, an additional map structure might be required which may ultimately use up a large amount of memory.
      Thirdly, I would suppose that by using the bam_fetch to the whole inter-gene regions, ultimately, what they are doing is still transverses all the reads one by one using the same mechanism as the slow method? I am not sure, maybe I should try and write a pseudo programme to time it. Will come back afterwards.

      Comment


      • #4
        Originally posted by choishingwan View Post
        Secondly, I would supposed that if a read is spamming across the inter-gene regions and the annotated region, it will got counted twice?
        Yes, reads which straddle a gene boundary would be found when using bam_fetch on the gene regions, and also when using bam_fetch on the inter-gene regions. How you deal with this depends on why you want these reads.
        Originally posted by choishingwan View Post
        Thirdly, I would suppose that by using the bam_fetch to the whole inter-gene regions, ultimately, what they are doing is still transverses all the reads one by one using the same mechanism as the slow method?
        No. The point of the BAM index is to greatly reduce the part of the file which must be scanned, focussing only on those reads mapped to the region of interest. This works really well if you are only interested in a small region compared to the whole genome.

        Comment


        • #5
          I see, I will try and see if by using the bam_fetch instead of the read by read reading and see if the speed improves. Will come back to you once I have got the time done

          Comment


          • #6
            So I have just finished the test using the following code:

            #include <string>
            #include <iostream>
            #include <cstdlib>
            #include <bam.h>
            #include <sstream>
            #define WHERE fprintf(stderr, "%d\n", __LINE__)

            using namespace std;

            int countReads(const bam1_t *bamRead, void *data){
            return 0;
            }

            int main(int argc, char* argv[]){
            if(argc != 3){
            cerr << "Usage: " << argv[0] << " <bam file> <mode> ";
            exit(0);
            }

            bam1_t* b=bam_init1();
            bamFile in = bam_open(argv[1], "r");
            bam_header_t *header;
            bam_index_t *idx;
            idx = bam_index_load(argv[1]);
            header = bam_header_read(in);
            if(in==NULL) return -1;
            if(b==NULL) return -1;
            if(idx==NULL) return -1;
            if(header == NULL) return -1;
            cerr << "Alright" << endl;
            if(atoi(argv[2]) == 1){
            //bam_fetch
            for(int i = 0; i < header->n_targets-1;++i){
            string chr = header->target_name[i];
            int length = header->target_len[i];
            string location = chr + ":1-";
            stringstream ss;
            ss << length-1;
            location.append(ss.str());
            ss.str() = "";
            int begin, end, ref_id;
            bam_parse_region(header, location.c_str(), &ref_id, &begin, &end);
            int* temp = new int();
            bam_fetch(in, idx, ref_id, begin, end, temp, countReads);
            }
            }
            else{
            //Going through the reads one by one
            while(bam_read1(in,b) >=0){
            }
            }
            bam_header_destroy(header);
            bam_close(in);
            bam_destroy1(b);
            bam_index_destroy(idx);
            return 0;

            }
            With a 7.8GB input, if I use bam_fetch, on average I got the following time:
            real 1m42.9934s
            user 1m40.7904s
            sys 0m1.9088s

            And if I use bam_read1, I on average got the following time:
            real 1m54.7636s
            user 1m52.5102s
            sys 0m1.9284s

            As my code doesn't contain any of the modification, I think the time will scale up quite a lot. However, it does seems like the difference is rather acceptable. Guess the reason why I though bam_read1 is much slower comes from the time where our server is heavy loaded and that I was actually trying to find reads in specific region (which is obviously a job bam_fetch designed for).

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Working...
            X