Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract subsequence from genomic fasta file

    Hi,

    I am new to this forum. The reason I joined is to ask the following:

    How do I extract a sub-sequence from a fasta file containing the human genome?

    TIRG's cdbyank was good for pulling whole sequences out of a multifasta file, but that's now what I want. I want to extract something like the following:

    "chr6:26795842-26792619"

    ie given chromosome number, start and end points, get the DNA sequence.

    I have BWA indexed fasta files files, samtools indexed fasta files, and another. Can this be done with BWA, or with samtools?

    Cheers,
    Joe White
    Mass. Eye and Ear Infirmary

  • #2
    EMBOSS has tools available to do this

    Comment


    • #3
      samtools can do this too

      Comment


      • #4
        ncbi blast has start & end parameters to retrieve subseuences

        Comment


        • #5
          Originally posted by genericforms View Post
          samtools can do this too
          Which tool?

          jwhite

          Comment


          • #6
            UCSC, of course, has a tool for occasional web based use:



            Hack the url to hg19 or whatever. You can get to it from the link at the top of a browser page.

            If you need something really fast for calling from the command line, compile this file (fetchdna.c"). (I know, "it's not using a nib file" or the like, I got one! ... but this is simpler).
            Code:
            /*
            compile : gcc -Wall -O2 -o fetchdna fetchdna.c
            
            fetchdna - gets dna from canonical genomic fastas with line len = 50
            Usage : fetchdna range directory_path_to_canonical_genomic_fastas\n");
            Example : ./fetchdna chr17:15000-16000 /h1/finneyr/amd64/hg18/ \n");
            
            Instead of typing the location of your path to the genomic fastas, make a bash file like this ...
            
            echo "~/bin/fetchdna $1 /h1/finneyr/amd64/hg18/" > myfetch
            chmod +x myfetch
            use like this :  ./myfetch chr17:15000-16000
            */
            
            #include <stdio.h>
            #include <stdlib.h>
            #include <string.h>
            
            #define MAXBUFF 5000
            #define MAX_FETCH_SIZE 1000002
            char path[MAXBUFF];
            char dna[MAX_FETCH_SIZE ];
            
            static void comma_gin(char *s) // gets rid of commas in a string  - dang commanists !!!
            {
                char *z;
                char tmps[MAXBUFF+10];
                while (1)
                {
                    z = strstr(s,",");
                    if (!z) break;
                    strcpy(tmps,z+1);
                    strcpy(z,tmps);
                }
                return;
            }
            
            static int parse_position(const char argposition[],char chr[],int *start,int *end)
            {
                int i;
                char tmps[1024];
                char t[MAXBUFF];
                char tmps1[MAXBUFF];
                char tmps2[MAXBUFF];
            
                tmps[0] = tmps1[0] = tmps2[0] = t[0] = (char)0;
                strcpy(t,argposition);
                   // un_escape(t);        for when parsing via a URL which often put escape codes in
                strcpy(tmps,t);
                for (i=0 ; tmps[i] ; i++) { if (tmps[i] == ':') { tmps[i] = ' '; } if (tmps[i] == '-') { tmps[i] = ' '; } }
                sscanf(tmps,"%s %s %s",chr,tmps1,tmps2);
                if (strcmp(chr,"chr23") == 0) strcpy(chr,"chrX");
                else if (strcmp(chr,"chr24") == 0) strcpy(chr,"chrY");
                else if (strcmp(chr,"chr25") == 0) strcpy(chr,"chrM");
                comma_gin(tmps1);
                *start = atoi(tmps1);
                comma_gin(tmps2);
                *end = atoi(tmps2);
                return 0;
            }
            
            void fetch(char fn[],long int pos,char dna[],int len)
            {
                char s[MAXBUFF];
                register int i;
                register int k;
                int headlen;
                long spot;
                FILE *fp;
            
                           // printf("fetch : %s %ld %d\n",fn,pos,len);
                dna[0] = (char)0;
                fp = fopen(fn,"r");
                if (fp == (void *)0) { fprintf(stderr,"%s not open error.  Invalid filename. \n",fn); exit(0); }
                fgets(s,1022,fp);  // eat head line
                headlen = strlen(s) ;
                spot = headlen +  pos + (pos/50);
                           // printf("spot = %ld headlen = %d pos = %ld  len=%d\n",spot,headlen,pos,len);
                fseek(fp,spot,SEEK_SET);
                i  = 0;
                while (i < len)
                {
                    k = fgetc(fp);
                    if ((char)k == '\r') continue;
                    if ((char)k == '\n') continue;
                    dna[i++] = (char)k;
                }
                dna[i+1] = (char)0;
                fclose(fp);
                return;
            }
            
            
            int fetchwrap(char *chr,int s, int e)
            {
                int k;
                char fn[512]; //  file name to canonical fasta for a chromsome
                if ((e-s)>MAX_FETCH_SIZE-2)
                {
                    fprintf(stderr,"ERROR- too big %s %d %d , cant be bigger than %d \n",chr,s,e,MAX_FETCH_SIZE);
                    return -1;
                }
                sprintf(fn,"%s/%s.fa",path,chr);
                dna[0] = (char)0;
                fetch(fn,s,dna,e-s);
                for (k=0 ; dna[k] ; k++)
                {
                    printf("%c",dna[k]);
                    if ((k%50) == 49) printf("\n");
                }
                printf("\n");
                return 0;
            }
            
            int main(int argc,char *argv[])
            {
                char position[MAXBUFF];
                char chr[MAXBUFF];
                int start,end;
            
                if (argc != 3)
                {
                    fprintf(stderr,"ERROR: usage example is ./fetchdna chr17:15000-16000 /h1/finneyr/amd64/hg18/ \n");
                    fprintf(stderr,"Usage is fetchdna range directory_path_to_canonical_genomic_fastas\n");
                    fprintf(stderr,"Note: start of chromosome is 1, not 0\n");
                    return 1;
                }
            
                strcpy(position,argv[1]);      // buffer overflow , careful if webizing
                strcpy(path,argv[2]);
                parse_position(position,chr,&start,&end); // eg.: "position=chrX:37301314-37347604"
                start = start - 1;
                end = end ;
                fetchwrap(chr,start,end);
                return 0;
            }
            Last edited by Richard Finney; 06-28-2012, 09:00 AM. Reason: spelling

            Comment


            • #7
              Just to add another to the list, bedtools getfasta will do this, and it is very easy to integrate into a variety of scripts.

              Comment


              • #8
                Thanks for your answers folks.

                The c program and remarks about samtools, bedtools were helpful..

                Just by chance I noticed in the samtools docs that faidx not only indexes sequences, but will retrieve sequences if given a region (eg chr2:1234-1266).
                This works very nicely:

                samtools faidx fasta.fa region

                This can be scripted easily.

                Just wanted to pass this on for future reference.

                Cheers,
                Joe

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                66 views
                0 likes
                Last Post seqadmin  
                Working...
                X