Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract full fasta file for local blast hits

    Hi all,

    I would like to take local blast hit sequences, i.e. hsp_hseq in blast xml
    format, and extract the full sequences of those hits from the original fasta
    file (as the xml output only provides those hsp's) and put them in a fasta file of all hits that I can use later.

    This should be a widely performed task but I can't find any scripts that
    directly address this issue. I have not acquired the skills to make my own
    scripts for this task.

    Thanks in advance

  • #2
    Oyster,

    This thread (http://seqanswers.com/forums/showthread.php?t=8805) I think covers your question. Basically you construct a list of accessions or ids (in your case for your hits of interest) and use that list to extract the full sequences from the BLAST database using the accompanying BLAST command line tools.

    Comment


    • #3
      Thanks kmcarr. Indeed, that post was exactly what I needed, yet I remain with a couple of questions.

      'makeblastdb' with the '-parse_seqids' option seems like way to go to do this task.

      However, I have 12 lanes of illumina data. Trying makeblastdb on one lane took me 3 days and I had to stop this because I need the RAM for some other work.

      So I thought I'd go about it in a different way using the list of hits from blastn

      blastn -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc"

      I have a text file with all my hits. finding these hits in the original fasta file. I tried out a bunch of scripts to do this, like those I've attached, but they all have failed.

      I consistently get a "Segmentation fault" after the index builds to some 545 GB, taking up all of my hard drive's memory. Obviously there is something funny going on with the indexing.

      The fasta file is straightforward:
      >1
      CTCAAATTATAGTAATATCACGATGCAGTTATATGATATTACATGTGATACGTAAAAAAAATTATTATTTGTGCTATAATCATCTTGTCATGTGAACATGGAATGAG
      >2
      TCTTTGCAAAATCTGTGTTAAATTGACAGAATAGTTTGTATTGTTCAACCAGGTAAAAATATATCACTCATTTTAATTCACTAAAAGTCACAAAATTTACTTTTCAT
      >3
      GACTTTTTGTTAACGTATTTAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGA

      yet it runs to 35 million or so

      (The wrapping looks weird here but that is just due to the text box's properties.) I reformatted the fasta file using sreformat as I saw it mentioned elsewhere that this may help with indexing- still no luck.

      Does anyone have an strategy to index large fasta files (~3.5GB) composed of some 35 million short reads? Once indexed properly, I assume it will be no problem to extract the sequences of interests, which number in the thousands for me.

      Or maybe I should just have to stop cutting corners and let "makeblastdb" run for a few more days. Any suggestions?
      Attached Files

      Comment


      • #4
        you got their number ...

        if they are ordered like that then you know where in the file they are,
        just go and get them from the command line

        maybe collapse the sequence onto the same line as the defline to avoid
        any arithmetic

        Comment


        • #5
          yes i do have their number

          I have not succeeded in grabbing the +3000 sequences from my +35 million sequences.

          If you know a way, please do share.

          Thanks

          Comment


          • #6
            Originally posted by Oyster View Post
            Thanks kmcarr. Indeed, that post was exactly what I needed, yet I remain with a couple of questions.

            'makeblastdb' with the '-parse_seqids' option seems like way to go to do this task.

            However, I have 12 lanes of illumina data. Trying makeblastdb on one lane took me 3 days and I had to stop this because I need the RAM for some other work.

            So I thought I'd go about it in a different way using the list of hits from blastn

            blastn -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc"

            I have a text file with all my hits. finding these hits in the original fasta file. I tried out a bunch of scripts to do this, like those I've attached, but they all have failed.

            I consistently get a "Segmentation fault" after the index builds to some 545 GB, taking up all of my hard drive's memory. Obviously there is something funny going on with the indexing.

            The fasta file is straightforward:
            >1
            CTCAAATTATAGTAATATCACGATGCAGTTATATGATATTACATGTGATACGTAAAAAAAATTATTATTTGTGCTATAATCATCTTGTCATGTGAACATGGAATGAG
            >2
            TCTTTGCAAAATCTGTGTTAAATTGACAGAATAGTTTGTATTGTTCAACCAGGTAAAAATATATCACTCATTTTAATTCACTAAAAGTCACAAAATTTACTTTTCAT
            >3
            GACTTTTTGTTAACGTATTTAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGA

            yet it runs to 35 million or so

            (The wrapping looks weird here but that is just due to the text box's properties.) I reformatted the fasta file using sreformat as I saw it mentioned elsewhere that this may help with indexing- still no luck.

            Does anyone have an strategy to index large fasta files (~3.5GB) composed of some 35 million short reads? Once indexed properly, I assume it will be no problem to extract the sequences of interests, which number in the thousands for me.

            Or maybe I should just have to stop cutting corners and let "makeblastdb" run for a few more days. Any suggestions?
            Oyster,

            I'm going to suggest that you may need to find an alternate approach to your problem. Creating and searching against a BLAST database of hundreds of millions of short reads is an impossible task, not to mention the output would be pretty uninformative. The short reads have no intrinsic information associated with them. What could you learn from this other than how many short reads are an approximate match to whatever query sequence you submit.

            Perhaps you could describe what you are trying to accomplish and we could see if there is a more efficient approach.

            Comment


            • #7
              You somehow know the identity of 3000+ sequences if interest
              in your 3.5G fasta file and you would like to isolate them.

              (more background information would help.)

              you do not seem to be able to index the fasta file
              because of some limitation. this is not encouraging.

              (information of the system you are using would help)

              You indicate the sequences in the fasta file have simple integers
              for their defline and that they are short reads.
              I am going to make a leap here and assume they continue to be ordered
              beyond the first three you show. and that each short read
              is all on its own line. (basically the defline)

              furthermore, that you are able to get your list of sequences of interest
              into sorted order one per line. maybe,
              Code:
               sort -n random.list > sorted.list
              notice that each fasta record n is at lines 2n-1 and 2*n
              you could use your favorite scripting language to open the
              fasta file and carefully seek to each record and output it
              without any searching.

              or use a command line tool like awk and change what you are
              searching for as you go
              *note this relies on both files being ordered*

              Code:
              awk 'BEGIN{i=0;while(getline a[i++]<"sorted.list">0){};i=0}a[i]==$1{i++;print;getline;print}' bigseqfile.fa
              but if you are having problems indexing the fasta file
              and they are resource related, you are likely going to
              continue to have problems till you address that.

              Comment


              • #8
                Hello-

                (I posted this in another forum, as well, but this is similar):
                I'm trying to 'extract' nucleotide sequences from my results after running a local blastn against my local database. The output format right now is a standard "blast-looking" result page, but it's not easy to work with the results when I want to further compare sequences. (I have a gene- and will have genes- of interest. I want to see how they compare to the local database of my sequences, but then I want the results in FASTA format for further analyses and comparisons). The help manual for local blast doesn't seem to have the answers I am looking for

                Will gladly provide additional information if more is needed. I really thank anyone who is able to help!

                Comment


                • #9
                  Sequence of blast results

                  Lets try saying it this way.

                  Don't extract sequences from your blast results.

                  Extract a list of sequence identifiers (top hit perhaps) from your blast results.
                  Then use that list of identifiers to extract the fasta sequence from the blast _database_ your blast results are from.

                  That is:
                  There are command line tools that given an identifier (or list of identifiers) will return those fasta that went into making your "local blast database".

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X