Seqanswers Leaderboard Ad

**maubp** · 09-21-2012, 02:35 AM

So why don't you try the bam_fetch API but using the inter-gene regions?

**choishingwan** · 09-21-2012, 06:33 AM

There are a few problems I noted with the bam_fetch, the first one most notably is that the reads returned from the bam_fetch functions sometimes behave rather funny. There was a time when I got read that has CIGAR strings that is not found in the sam file (I samtools view the file and grep the read name, the CIGAR was different, there is an additional alignment that pop up)
Secondly, I would supposed that if a read is spamming across the inter-gene regions and the annotated region, it will got counted twice? In order to record and prevent such situation, an additional map structure might be required which may ultimately use up a large amount of memory.
Thirdly, I would suppose that by using the bam_fetch to the whole inter-gene regions, ultimately, what they are doing is still transverses all the reads one by one using the same mechanism as the slow method? I am not sure, maybe I should try and write a pseudo programme to time it. Will come back afterwards.

**maubp** · 09-21-2012, 06:46 AM

Originally posted by choishingwan View Post

Secondly, I would supposed that if a read is spamming across the inter-gene regions and the annotated region, it will got counted twice?

Yes, reads which straddle a gene boundary would be found when using bam_fetch on the gene regions, and also when using bam_fetch on the inter-gene regions. How you deal with this depends on why you want these reads.

Originally posted by choishingwan View Post

Thirdly, I would suppose that by using the bam_fetch to the whole inter-gene regions, ultimately, what they are doing is still transverses all the reads one by one using the same mechanism as the slow method?

No. The point of the BAM index is to greatly reduce the part of the file which must be scanned, focussing only on those reads mapped to the region of interest. This works really well if you are only interested in a small region compared to the whole genome.

**choishingwan** · 09-21-2012, 07:33 AM

I see, I will try and see if by using the bam_fetch instead of the read by read reading and see if the speed improves. Will come back to you once I have got the time done

**choishingwan** · 09-25-2012, 09:30 PM

So I have just finished the test using the following code:

#include <string>
#include <iostream>
#include <cstdlib>
#include <bam.h>
#include <sstream>
#define WHERE fprintf(stderr, "%d\n", __LINE__)

using namespace std;

int countReads(const bam1_t *bamRead, void *data){
return 0;
}

int main(int argc, char* argv[]){
if(argc != 3){
cerr << "Usage: " << argv[0] << " <bam file> <mode> ";
exit(0);
}

bam1_t* b=bam_init1();
bamFile in = bam_open(argv[1], "r");
bam_header_t *header;
bam_index_t *idx;
idx = bam_index_load(argv[1]);
header = bam_header_read(in);
if(in==NULL) return -1;
if(b==NULL) return -1;
if(idx==NULL) return -1;
if(header == NULL) return -1;
cerr << "Alright" << endl;
if(atoi(argv[2]) == 1){
//bam_fetch
for(int i = 0; i < header->n_targets-1;++i){
string chr = header->target_name[i];
int length = header->target_len[i];
string location = chr + ":1-";
stringstream ss;
ss << length-1;
location.append(ss.str());
ss.str() = "";
int begin, end, ref_id;
bam_parse_region(header, location.c_str(), &ref_id, &begin, &end);
int* temp = new int();
bam_fetch(in, idx, ref_id, begin, end, temp, countReads);
}
}
else{
//Going through the reads one by one
while(bam_read1(in,b) >=0){
}
}
bam_header_destroy(header);
bam_close(in);
bam_destroy1(b);
bam_index_destroy(idx);
return 0;

}

With a 7.8GB input, if I use bam_fetch, on average I got the following time:
real 1m42.9934s
user 1m40.7904s
sys 0m1.9088s

And if I use bam_read1, I on average got the following time:
real 1m54.7636s
user 1m52.5102s
sys 0m1.9284s

As my code doesn't contain any of the modification, I think the time will scale up quite a lot. However, it does seems like the difference is rather acceptable. Guess the reason why I though bam_read1 is much slower comes from the time where our server is heavy loaded and that I was actually trying to find reads in specific region (which is obviously a job bam_fetch designed for).

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 22 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Quickly Reading Through a Bam file using C++ programme

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News