Seqanswers Leaderboard Ad

**swbarnes2** · 04-15-2014, 10:40 AM

Doesn't the MD line tell you how the read differs from the genome? Wouldn't it be easier to use that and the sequence from the SAM line, so you don't have to look at another file at all?

**westerman** · 04-15-2014, 10:57 AM

How fast do you want it to be? 'samtools faidx', in my experience, is rather speedy but you are asking for a lot of data. As a test, using bash I tried fetching 100,000 genome coordinates (randomly generated asking for lengths from 50 to 100 bases on all chromosomes) via 'samtools faidx'. It took about 7 minutes wall-clock time and generated a 10 MB file. At that rate your desired 500 million would take ... hum, around 25 days and generate a 50GB file.

If I do the 100,000 coordinates 100 at a time (i.e., 100 regions on the 'samtools faidx' command line thus avoiding having to start up samtools 100,000 times as opposed to only 1000 times) then the time remains the same. So I suspect that the delay is either samtool's efficiency or simply having to read the largish human genome.

You might be able to cut that time down if you first separated your input into chromosomes and then did per-chromosome 'samtools faidx' in parallel -- assuming that your disks are fast enough. And/or you can use a RAM-disk to hold the genome file and the *.fai file.

I doubt if doing a web fetch on 500 million fragments will be fast nor will make you popular with UCSC.

jm's solution could also be feasible. No matter how you slice it taking 500 million of anything takes time.

**westerman** · 04-15-2014, 10:59 AM

swbarne's comment (use the MD line) was also something I was thinking about. But it takes more programming than using samtools directly.

**vivek_** · 04-15-2014, 11:08 AM

System calls from within a scripting language will always add a lot of latency. Since you seem interested in using Python, I came across this module yesterday which might be applicable here

GitHub - mdshw5/pyfaidx: Efficient pythonic random access to fasta subsequences

https://github.com/mdshw5/pyfaidx

Efficient pythonic random access to fasta subsequences - mdshw5/pyfaidx

Samtools provides a function "faidx" (FAsta InDeX), which creates a small flat index file ".fai" allowing for fast random access to any subsequence in the indexed fasta, while loading a minimal amount of the file in to memory.

Pyfaidx provides an interface for creating and using this index for fast random access of DNA subsequences from huge fasta files in a "pythonic" manner. Indexing speed is comparable to samtools, and in some cases sequence retrieval is much faster (benchmark).

**dpryan** · 04-15-2014, 11:22 AM

If you need it to be quick then just read the genome into memory once and then iterate over things. That's how those of us who have written methylation callers (where you have to do this exact process) do things, since it gives the best performance.

Anyway, as the others said, it's easier to just parse the MD string if it exists and is valid.

**swbarnes2** · 04-15-2014, 12:29 PM

Originally posted by westerman View Post

swbarne's comment (use the MD line) was also something I was thinking about. But it takes more programming than using samtools directly.

In the 25 days it would take to use samtools faidx hundreds of millions of times, you could learn how to program well enough to parse the SAM entry.

**xiangwulu** · 04-15-2014, 01:53 PM

Originally posted by swbarnes2 View Post

Doesn't the MD line tell you how the read differs from the genome? Wouldn't it be easier to use that and the sequence from the SAM line, so you don't have to look at another file at all?

The MD tag in sam files tell something, but not enough. Also, some tools does not output MD tag in sam file, like blat, bfast.

**mdshw5** · 04-22-2015, 05:57 AM

Originally posted by vivek_ View Post

System calls from within a scripting language will always add a lot of latency. Since you seem interested in using Python, I came across this module yesterday which might be applicable here

https://github.com/mdshw5/pyfaidx

This is exactly right. System calls from a scripting language will be slow, and pyfaidx is quite a bit better than calls to samtools or pysam (see Table 1 in my pre-print for pyfaidx). However, dpryan's suggestion of using raw strings (or something close, like Biopython's Seq.IO) will always be the fastest option, at the expense of keeping track of your own substring indexes. If "fast enough" is okay, I would suggest using pyfaidx as it's extensively tested for correctness and has a reasonable API for working with your existing indexed fasta files efficiently.

However, I may be biased.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

a fast way to get human genome sequence by coordinate

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News