Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Oyster
    Junior Member
    • Nov 2010
    • 4

    extract full fasta file for local blast hits

    Hi all,

    I would like to take local blast hit sequences, i.e. hsp_hseq in blast xml
    format, and extract the full sequences of those hits from the original fasta
    file (as the xml output only provides those hsp's) and put them in a fasta file of all hits that I can use later.

    This should be a widely performed task but I can't find any scripts that
    directly address this issue. I have not acquired the skills to make my own
    scripts for this task.

    Thanks in advance
  • kmcarr
    Senior Member
    • May 2008
    • 1181

    #2
    Oyster,

    This thread (http://seqanswers.com/forums/showthread.php?t=8805) I think covers your question. Basically you construct a list of accessions or ids (in your case for your hits of interest) and use that list to extract the full sequences from the BLAST database using the accompanying BLAST command line tools.

    Comment

    • Oyster
      Junior Member
      • Nov 2010
      • 4

      #3
      Thanks kmcarr. Indeed, that post was exactly what I needed, yet I remain with a couple of questions.

      'makeblastdb' with the '-parse_seqids' option seems like way to go to do this task.

      However, I have 12 lanes of illumina data. Trying makeblastdb on one lane took me 3 days and I had to stop this because I need the RAM for some other work.

      So I thought I'd go about it in a different way using the list of hits from blastn

      blastn -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc"

      I have a text file with all my hits. finding these hits in the original fasta file. I tried out a bunch of scripts to do this, like those I've attached, but they all have failed.

      I consistently get a "Segmentation fault" after the index builds to some 545 GB, taking up all of my hard drive's memory. Obviously there is something funny going on with the indexing.

      The fasta file is straightforward:
      >1
      CTCAAATTATAGTAATATCACGATGCAGTTATATGATATTACATGTGATACGTAAAAAAAATTATTATTTGTGCTATAATCATCTTGTCATGTGAACATGGAATGAG
      >2
      TCTTTGCAAAATCTGTGTTAAATTGACAGAATAGTTTGTATTGTTCAACCAGGTAAAAATATATCACTCATTTTAATTCACTAAAAGTCACAAAATTTACTTTTCAT
      >3
      GACTTTTTGTTAACGTATTTAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGA

      yet it runs to 35 million or so

      (The wrapping looks weird here but that is just due to the text box's properties.) I reformatted the fasta file using sreformat as I saw it mentioned elsewhere that this may help with indexing- still no luck.

      Does anyone have an strategy to index large fasta files (~3.5GB) composed of some 35 million short reads? Once indexed properly, I assume it will be no problem to extract the sequences of interests, which number in the thousands for me.

      Or maybe I should just have to stop cutting corners and let "makeblastdb" run for a few more days. Any suggestions?
      Attached Files

      Comment

      • tomc
        Member
        • Feb 2011
        • 29

        #4
        you got their number ...

        if they are ordered like that then you know where in the file they are,
        just go and get them from the command line

        maybe collapse the sequence onto the same line as the defline to avoid
        any arithmetic

        Comment

        • Oyster
          Junior Member
          • Nov 2010
          • 4

          #5
          yes i do have their number

          I have not succeeded in grabbing the +3000 sequences from my +35 million sequences.

          If you know a way, please do share.

          Thanks

          Comment

          • kmcarr
            Senior Member
            • May 2008
            • 1181

            #6
            Originally posted by Oyster View Post
            Thanks kmcarr. Indeed, that post was exactly what I needed, yet I remain with a couple of questions.

            'makeblastdb' with the '-parse_seqids' option seems like way to go to do this task.

            However, I have 12 lanes of illumina data. Trying makeblastdb on one lane took me 3 days and I had to stop this because I need the RAM for some other work.

            So I thought I'd go about it in a different way using the list of hits from blastn

            blastn -db myDB -query myQuery -out myContigList.txt -outfmt "6 sallacc"

            I have a text file with all my hits. finding these hits in the original fasta file. I tried out a bunch of scripts to do this, like those I've attached, but they all have failed.

            I consistently get a "Segmentation fault" after the index builds to some 545 GB, taking up all of my hard drive's memory. Obviously there is something funny going on with the indexing.

            The fasta file is straightforward:
            >1
            CTCAAATTATAGTAATATCACGATGCAGTTATATGATATTACATGTGATACGTAAAAAAAATTATTATTTGTGCTATAATCATCTTGTCATGTGAACATGGAATGAG
            >2
            TCTTTGCAAAATCTGTGTTAAATTGACAGAATAGTTTGTATTGTTCAACCAGGTAAAAATATATCACTCATTTTAATTCACTAAAAGTCACAAAATTTACTTTTCAT
            >3
            GACTTTTTGTTAACGTATTTAGCCACATAGAAACCAACAGCCATATAACTGGTAGCTTTAAGCGGCTCACCTTTAGCATCAACAGGCCACAACCAACCAGAACGTGA

            yet it runs to 35 million or so

            (The wrapping looks weird here but that is just due to the text box's properties.) I reformatted the fasta file using sreformat as I saw it mentioned elsewhere that this may help with indexing- still no luck.

            Does anyone have an strategy to index large fasta files (~3.5GB) composed of some 35 million short reads? Once indexed properly, I assume it will be no problem to extract the sequences of interests, which number in the thousands for me.

            Or maybe I should just have to stop cutting corners and let "makeblastdb" run for a few more days. Any suggestions?
            Oyster,

            I'm going to suggest that you may need to find an alternate approach to your problem. Creating and searching against a BLAST database of hundreds of millions of short reads is an impossible task, not to mention the output would be pretty uninformative. The short reads have no intrinsic information associated with them. What could you learn from this other than how many short reads are an approximate match to whatever query sequence you submit.

            Perhaps you could describe what you are trying to accomplish and we could see if there is a more efficient approach.

            Comment

            • tomc
              Member
              • Feb 2011
              • 29

              #7
              You somehow know the identity of 3000+ sequences if interest
              in your 3.5G fasta file and you would like to isolate them.

              (more background information would help.)

              you do not seem to be able to index the fasta file
              because of some limitation. this is not encouraging.

              (information of the system you are using would help)

              You indicate the sequences in the fasta file have simple integers
              for their defline and that they are short reads.
              I am going to make a leap here and assume they continue to be ordered
              beyond the first three you show. and that each short read
              is all on its own line. (basically the defline)

              furthermore, that you are able to get your list of sequences of interest
              into sorted order one per line. maybe,
              Code:
               sort -n random.list > sorted.list
              notice that each fasta record n is at lines 2n-1 and 2*n
              you could use your favorite scripting language to open the
              fasta file and carefully seek to each record and output it
              without any searching.

              or use a command line tool like awk and change what you are
              searching for as you go
              *note this relies on both files being ordered*

              Code:
              awk 'BEGIN{i=0;while(getline a[i++]<"sorted.list">0){};i=0}a[i]==$1{i++;print;getline;print}' bigseqfile.fa
              but if you are having problems indexing the fasta file
              and they are resource related, you are likely going to
              continue to have problems till you address that.

              Comment

              • mgallo2
                Junior Member
                • Feb 2016
                • 2

                #8
                Hello-

                (I posted this in another forum, as well, but this is similar):
                I'm trying to 'extract' nucleotide sequences from my results after running a local blastn against my local database. The output format right now is a standard "blast-looking" result page, but it's not easy to work with the results when I want to further compare sequences. (I have a gene- and will have genes- of interest. I want to see how they compare to the local database of my sequences, but then I want the results in FASTA format for further analyses and comparisons). The help manual for local blast doesn't seem to have the answers I am looking for

                Will gladly provide additional information if more is needed. I really thank anyone who is able to help!

                Comment

                • tomc
                  Member
                  • Feb 2011
                  • 29

                  #9
                  Sequence of blast results

                  Lets try saying it this way.

                  Don't extract sequences from your blast results.

                  Extract a list of sequence identifiers (top hit perhaps) from your blast results.
                  Then use that list of identifiers to extract the fasta sequence from the blast _database_ your blast results are from.

                  That is:
                  There are command line tools that given an identifier (or list of identifiers) will return those fasta that went into making your "local blast database".

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM
                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 05:37 AM
                  0 responses
                  6 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-26-2026, 11:10 AM
                  0 responses
                  16 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  51 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  110 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...