SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Fastest way to extract differing positions from each alignment in a BAM file (http://seqanswers.com/forums/showthread.php?t=16211)

CHRYSES 12-14-2011 06:47 AM

Fastest way to extract differing positions from each alignment in a BAM file
 
Hi,

What would be the fastest way (I have to do this hundreds of millions times) to extract for each aligned read in a BAM file:
1) The positions where the read bases differ from a reference sequence.
2) The PHRED base quality values of these bases. If the difference is an indel, the quality value will, of course, be skipped.

As far as I know, I cannot use mpileup or anything I know of due to memory limitation as this is a very custom amplicon reference analysis, with >500 million coverage per base position on the reference amplicon.

In short, I need to apply an efficient approach to extract all differing positions for each aligned read.

Thanks.

adaptivegenome 12-14-2011 07:08 AM

samtools and bamtools both provide very fast APIs you can use, however this requires a minimal experience with a programming language or script...

CHRYSES 12-14-2011 07:13 AM

Quote:

Originally Posted by genericforms (Post 59548)
samtools and bamtools both provide very fast APIs you can use, however this requires a minimal experience with a programming language or script...

Yeah, I tried to get into that, but I am not good at "C" language, I could not follow it. I hope someone else has created/thought of something...

I could go directly into text format (i.e. SAM) and parse it with PERL, but that's really very slooooooow.

adaptivegenome 12-14-2011 07:18 AM

If you are going to examine every read in order then I suppose you could parse a giant text file. This sort of sequential analysis is hard to speed up unless you parallelize it. If you are parsing a text file you could break it into many parts and then run your PERL script on the many parts in parallel (if you have access to that kind of equipment).

I am not aware of an off the shelf tool. Sorry!

Personally I opt for pthreads and C/C++...

swbarnes2 12-14-2011 09:11 AM

Illumina reads are error prone. If you pull every single read with a discrepancy from reference, you are going to pull a lot of noise.

I don't think that a pileup can be generated with only variant positions, but you could grep the pileup to only get lines with alterante letters. The pileup will have the position, all the letters called by all the reads that cross the position, and all the qualities for all the reads that cross the position.

CHRYSES 12-14-2011 12:28 PM

Quote:

Originally Posted by swbarnes2 (Post 59570)
Illumina reads are error prone. If you pull every single read with a discrepancy from reference, you are going to pull a lot of noise.

I don't think that a pileup can be generated with only variant positions, but you could grep the pileup to only get lines with alterante letters. The pileup will have the position, all the letters called by all the reads that cross the position, and all the qualities for all the reads that cross the position.

Yes, but how can I run a pileup on a single position with 500 million X coverage ? I think i will need to do this read by read...


All times are GMT -8. The time now is 06:56 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.