Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Subsetting a fasta file based on a set of BLAST results

    Hi,

    I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

    I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so:
    >TR1-c0_g1_i1
    AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC (etc...)

    I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq.

    I therefore have the following.
    combined.fasta (trinity output of all the contigs)
    id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome)
    id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs)

    I would like to generate the following:
    symb.fasta = all those sequences in the combined.fasta from the id-symb list
    and
    host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list

    I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way.

    The assembly contains ~250,000 contigs.


    Thanks.

  • #2
    Using R, quite easy and fast...

    library(Biostrings)

    all_fasta <- read.DNAStringSet("combined.fasta") ## You have to give the path to your file combined.fasta

    id_symb <- scan("id_symb", what="character", sep="\n")

    symbFasta <- all_fasta[names(all_fasta) %in% id_symb]
    hostFasta <- all_fasta[! names(all_fasta) %in% id_symb]

    Comment


    • #3
      Another option is Jim Kent's faSomeRecords utility. I am linking the linux version but he has source/OS X executables available as well.



      faSomeRecords - Extract multiple fa records
      usage:
      faSomeRecords in.fa listFile out.fa
      options:
      -exclude - output sequences not in the list file.

      Comment


      • #4
        faSomeRecords = lifesaver

        Thank you so much genomax - that took less than 2 seconds for each file. The other way was still chugging away after two days...

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X