Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • a fast way to get human genome sequence by coordinate

    Hi all,

    I want to get a lot human genome fragments (more than 500 million of them) randomly.


    This is a partial work of the whole process. I have .sam result file from bowtie, with 10 million human genome reads alignment. I want to compare each query reads with the 'reference sequence it aligned to' from the sam file. The reference sequence I used is hg19.fa from UCSC. So I need to be able to get the sequence from hg19.fa (or chromosome files) by using the location in the sam file.


    e.g. with giving: chr4:35654-35695, i could get 42bp sequences:
    gtcttccagggtttttatatttttgggttttacacttaagt

    so far, i had 2 solutions:
    1. python script to fetch sequences from UCSC DAS server:
    http://genome.ucsc.edu/cgi-bin/das/h...r4:35654,35695

    2. using python script call ''samtools faidx'' command and return commnad output,
    from post:
    http://seqanswers.com/forums/showthr...ome+coordinate

    but, they are slow. samtools faidx is bit faster than getting it from DAS server, but still slow.

    so, is there any FAST way to do this? i have the seprate chromosome fasta files, and hg19.fa file.
    Last edited by xiangwulu; 04-15-2014, 08:26 AM.

  • #2
    Doesn't the MD line tell you how the read differs from the genome? Wouldn't it be easier to use that and the sequence from the SAM line, so you don't have to look at another file at all?

    Comment


    • #3
      How fast do you want it to be? 'samtools faidx', in my experience, is rather speedy but you are asking for a lot of data. As a test, using bash I tried fetching 100,000 genome coordinates (randomly generated asking for lengths from 50 to 100 bases on all chromosomes) via 'samtools faidx'. It took about 7 minutes wall-clock time and generated a 10 MB file. At that rate your desired 500 million would take ... hum, around 25 days and generate a 50GB file.

      If I do the 100,000 coordinates 100 at a time (i.e., 100 regions on the 'samtools faidx' command line thus avoiding having to start up samtools 100,000 times as opposed to only 1000 times) then the time remains the same. So I suspect that the delay is either samtool's efficiency or simply having to read the largish human genome.

      You might be able to cut that time down if you first separated your input into chromosomes and then did per-chromosome 'samtools faidx' in parallel -- assuming that your disks are fast enough. And/or you can use a RAM-disk to hold the genome file and the *.fai file.

      I doubt if doing a web fetch on 500 million fragments will be fast nor will make you popular with UCSC.

      jm's solution could also be feasible. No matter how you slice it taking 500 million of anything takes time.

      Comment


      • #4
        swbarne's comment (use the MD line) was also something I was thinking about. But it takes more programming than using samtools directly.

        Comment


        • #5
          System calls from within a scripting language will always add a lot of latency. Since you seem interested in using Python, I came across this module yesterday which might be applicable here



          Samtools provides a function "faidx" (FAsta InDeX), which creates a small flat index file ".fai" allowing for fast random access to any subsequence in the indexed fasta, while loading a minimal amount of the file in to memory.

          Pyfaidx provides an interface for creating and using this index for fast random access of DNA subsequences from huge fasta files in a "pythonic" manner. Indexing speed is comparable to samtools, and in some cases sequence retrieval is much faster (benchmark).

          Comment


          • #6
            If you need it to be quick then just read the genome into memory once and then iterate over things. That's how those of us who have written methylation callers (where you have to do this exact process) do things, since it gives the best performance.

            Anyway, as the others said, it's easier to just parse the MD string if it exists and is valid.

            Comment


            • #7
              Originally posted by westerman View Post
              swbarne's comment (use the MD line) was also something I was thinking about. But it takes more programming than using samtools directly.
              In the 25 days it would take to use samtools faidx hundreds of millions of times, you could learn how to program well enough to parse the SAM entry.

              Comment


              • #8
                Originally posted by swbarnes2 View Post
                Doesn't the MD line tell you how the read differs from the genome? Wouldn't it be easier to use that and the sequence from the SAM line, so you don't have to look at another file at all?
                The MD tag in sam files tell something, but not enough. Also, some tools does not output MD tag in sam file, like blat, bfast.

                Comment


                • #9
                  Originally posted by vivek_ View Post
                  System calls from within a scripting language will always add a lot of latency. Since you seem interested in using Python, I came across this module yesterday which might be applicable here

                  https://github.com/mdshw5/pyfaidx
                  This is exactly right. System calls from a scripting language will be slow, and pyfaidx is quite a bit better than calls to samtools or pysam (see Table 1 in my pre-print for pyfaidx). However, dpryan's suggestion of using raw strings (or something close, like Biopython's Seq.IO) will always be the fastest option, at the expense of keeping track of your own substring indexes. If "fast enough" is okay, I would suggest using pyfaidx as it's extensively tested for correctness and has a reasonable API for working with your existing indexed fasta files efficiently.

                  However, I may be biased.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  66 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X