SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
question about genome browser coordinate kaixinsjtu Bioinformatics 0 07-29-2011 09:10 PM
Statistical geneticist, human whole genome sequence analysis knome Industry Jobs! 0 05-05-2011 09:39 AM
1000 genome data and other human ref sequence differ johnadam33 Bioinformatics 4 01-05-2011 02:16 AM
Download promoter sequence db for human genome? ewilbanks Bioinformatics 3 10-27-2009 09:53 AM
Sequence and structural variation in a human genome uncovered by short-read, massivel benimmyeo Literature Watch 0 06-25-2009 02:14 AM

Reply
 
Thread Tools
Old 04-15-2014, 08:04 AM   #1
xiangwulu
Member
 
Location: ireland

Join Date: Apr 2014
Posts: 18
Default a fast way to get human genome sequence by coordinate

Hi all,

I want to get a lot human genome fragments (more than 500 million of them) randomly.


This is a partial work of the whole process. I have .sam result file from bowtie, with 10 million human genome reads alignment. I want to compare each query reads with the 'reference sequence it aligned to' from the sam file. The reference sequence I used is hg19.fa from UCSC. So I need to be able to get the sequence from hg19.fa (or chromosome files) by using the location in the sam file.


e.g. with giving: chr4:35654-35695, i could get 42bp sequences:
gtcttccagggtttttatatttttgggttttacacttaagt

so far, i had 2 solutions:
1. python script to fetch sequences from UCSC DAS server:
http://genome.ucsc.edu/cgi-bin/das/h...r4:35654,35695

2. using python script call ''samtools faidx'' command and return commnad output,
from post:
http://seqanswers.com/forums/showthr...ome+coordinate

but, they are slow. samtools faidx is bit faster than getting it from DAS server, but still slow.

so, is there any FAST way to do this? i have the seprate chromosome fasta files, and hg19.fa file.

Last edited by xiangwulu; 04-15-2014 at 08:26 AM.
xiangwulu is offline   Reply With Quote
Old 04-15-2014, 10:40 AM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Doesn't the MD line tell you how the read differs from the genome? Wouldn't it be easier to use that and the sequence from the SAM line, so you don't have to look at another file at all?
swbarnes2 is offline   Reply With Quote
Old 04-15-2014, 10:57 AM   #3
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

How fast do you want it to be? 'samtools faidx', in my experience, is rather speedy but you are asking for a lot of data. As a test, using bash I tried fetching 100,000 genome coordinates (randomly generated asking for lengths from 50 to 100 bases on all chromosomes) via 'samtools faidx'. It took about 7 minutes wall-clock time and generated a 10 MB file. At that rate your desired 500 million would take ... hum, around 25 days and generate a 50GB file.

If I do the 100,000 coordinates 100 at a time (i.e., 100 regions on the 'samtools faidx' command line thus avoiding having to start up samtools 100,000 times as opposed to only 1000 times) then the time remains the same. So I suspect that the delay is either samtool's efficiency or simply having to read the largish human genome.

You might be able to cut that time down if you first separated your input into chromosomes and then did per-chromosome 'samtools faidx' in parallel -- assuming that your disks are fast enough. And/or you can use a RAM-disk to hold the genome file and the *.fai file.

I doubt if doing a web fetch on 500 million fragments will be fast nor will make you popular with UCSC.

jm's solution could also be feasible. No matter how you slice it taking 500 million of anything takes time.
westerman is offline   Reply With Quote
Old 04-15-2014, 10:59 AM   #4
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

swbarne's comment (use the MD line) was also something I was thinking about. But it takes more programming than using samtools directly.
westerman is offline   Reply With Quote
Old 04-15-2014, 11:08 AM   #5
vivek_
PhD Student
 
Location: Denmark

Join Date: Jul 2012
Posts: 164
Default

System calls from within a scripting language will always add a lot of latency. Since you seem interested in using Python, I came across this module yesterday which might be applicable here

https://github.com/mdshw5/pyfaidx

Quote:
Samtools provides a function "faidx" (FAsta InDeX), which creates a small flat index file ".fai" allowing for fast random access to any subsequence in the indexed fasta, while loading a minimal amount of the file in to memory.

Pyfaidx provides an interface for creating and using this index for fast random access of DNA subsequences from huge fasta files in a "pythonic" manner. Indexing speed is comparable to samtools, and in some cases sequence retrieval is much faster (benchmark).
vivek_ is offline   Reply With Quote
Old 04-15-2014, 11:22 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

If you need it to be quick then just read the genome into memory once and then iterate over things. That's how those of us who have written methylation callers (where you have to do this exact process) do things, since it gives the best performance.

Anyway, as the others said, it's easier to just parse the MD string if it exists and is valid.
dpryan is offline   Reply With Quote
Old 04-15-2014, 12:29 PM   #7
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by westerman View Post
swbarne's comment (use the MD line) was also something I was thinking about. But it takes more programming than using samtools directly.
In the 25 days it would take to use samtools faidx hundreds of millions of times, you could learn how to program well enough to parse the SAM entry.
swbarnes2 is offline   Reply With Quote
Old 04-15-2014, 01:53 PM   #8
xiangwulu
Member
 
Location: ireland

Join Date: Apr 2014
Posts: 18
Default

Quote:
Originally Posted by swbarnes2 View Post
Doesn't the MD line tell you how the read differs from the genome? Wouldn't it be easier to use that and the sequence from the SAM line, so you don't have to look at another file at all?
The MD tag in sam files tell something, but not enough. Also, some tools does not output MD tag in sam file, like blat, bfast.
xiangwulu is offline   Reply With Quote
Old 04-22-2015, 05:57 AM   #9
mdshw5
Junior Member
 
Location: Baltimore, MD

Join Date: Dec 2011
Posts: 1
Thumbs up

Quote:
Originally Posted by vivek_ View Post
System calls from within a scripting language will always add a lot of latency. Since you seem interested in using Python, I came across this module yesterday which might be applicable here

https://github.com/mdshw5/pyfaidx
This is exactly right. System calls from a scripting language will be slow, and pyfaidx is quite a bit better than calls to samtools or pysam (see Table 1 in my pre-print for pyfaidx). However, dpryan's suggestion of using raw strings (or something close, like Biopython's Seq.IO) will always be the fastest option, at the expense of keeping track of your own substring indexes. If "fast enough" is okay, I would suggest using pyfaidx as it's extensively tested for correctness and has a reasonable API for working with your existing indexed fasta files efficiently.

However, I may be biased.
mdshw5 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO