Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quickly Reading Through a Bam file using C++ programme

    Hi, I am currently trying to write some programme that will transverse the whole bam file in a short period of time.

    The standard samtools api works extreme fast if I use bam_fetch (I can read through the whole transcriptom regions (gtf file from UCSC) within half an hour using a 7gb input)

    The problem is, I want to detect reads in regions not specified by the transcriptom region (not in the user input). If I use the standard function in reading the reads one after another, it will take me around 2 hours. Is there any suggestion for me to speed the process up?

    Thank you!

  • #2
    So why don't you try the bam_fetch API but using the inter-gene regions?

    Comment


    • #3
      There are a few problems I noted with the bam_fetch, the first one most notably is that the reads returned from the bam_fetch functions sometimes behave rather funny. There was a time when I got read that has CIGAR strings that is not found in the sam file (I samtools view the file and grep the read name, the CIGAR was different, there is an additional alignment that pop up)
      Secondly, I would supposed that if a read is spamming across the inter-gene regions and the annotated region, it will got counted twice? In order to record and prevent such situation, an additional map structure might be required which may ultimately use up a large amount of memory.
      Thirdly, I would suppose that by using the bam_fetch to the whole inter-gene regions, ultimately, what they are doing is still transverses all the reads one by one using the same mechanism as the slow method? I am not sure, maybe I should try and write a pseudo programme to time it. Will come back afterwards.

      Comment


      • #4
        Originally posted by choishingwan View Post
        Secondly, I would supposed that if a read is spamming across the inter-gene regions and the annotated region, it will got counted twice?
        Yes, reads which straddle a gene boundary would be found when using bam_fetch on the gene regions, and also when using bam_fetch on the inter-gene regions. How you deal with this depends on why you want these reads.
        Originally posted by choishingwan View Post
        Thirdly, I would suppose that by using the bam_fetch to the whole inter-gene regions, ultimately, what they are doing is still transverses all the reads one by one using the same mechanism as the slow method?
        No. The point of the BAM index is to greatly reduce the part of the file which must be scanned, focussing only on those reads mapped to the region of interest. This works really well if you are only interested in a small region compared to the whole genome.

        Comment


        • #5
          I see, I will try and see if by using the bam_fetch instead of the read by read reading and see if the speed improves. Will come back to you once I have got the time done

          Comment


          • #6
            So I have just finished the test using the following code:

            #include <string>
            #include <iostream>
            #include <cstdlib>
            #include <bam.h>
            #include <sstream>
            #define WHERE fprintf(stderr, "%d\n", __LINE__)

            using namespace std;

            int countReads(const bam1_t *bamRead, void *data){
            return 0;
            }

            int main(int argc, char* argv[]){
            if(argc != 3){
            cerr << "Usage: " << argv[0] << " <bam file> <mode> ";
            exit(0);
            }

            bam1_t* b=bam_init1();
            bamFile in = bam_open(argv[1], "r");
            bam_header_t *header;
            bam_index_t *idx;
            idx = bam_index_load(argv[1]);
            header = bam_header_read(in);
            if(in==NULL) return -1;
            if(b==NULL) return -1;
            if(idx==NULL) return -1;
            if(header == NULL) return -1;
            cerr << "Alright" << endl;
            if(atoi(argv[2]) == 1){
            //bam_fetch
            for(int i = 0; i < header->n_targets-1;++i){
            string chr = header->target_name[i];
            int length = header->target_len[i];
            string location = chr + ":1-";
            stringstream ss;
            ss << length-1;
            location.append(ss.str());
            ss.str() = "";
            int begin, end, ref_id;
            bam_parse_region(header, location.c_str(), &ref_id, &begin, &end);
            int* temp = new int();
            bam_fetch(in, idx, ref_id, begin, end, temp, countReads);
            }
            }
            else{
            //Going through the reads one by one
            while(bam_read1(in,b) >=0){
            }
            }
            bam_header_destroy(header);
            bam_close(in);
            bam_destroy1(b);
            bam_index_destroy(idx);
            return 0;

            }
            With a 7.8GB input, if I use bam_fetch, on average I got the following time:
            real 1m42.9934s
            user 1m40.7904s
            sys 0m1.9088s

            And if I use bam_read1, I on average got the following time:
            real 1m54.7636s
            user 1m52.5102s
            sys 0m1.9284s

            As my code doesn't contain any of the modification, I think the time will scale up quite a lot. However, it does seems like the difference is rather acceptable. Guess the reason why I though bam_read1 is much slower comes from the time where our server is heavy loaded and that I was actually trying to find reads in specific region (which is obviously a job bam_fetch designed for).

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            50 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            50 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            44 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X