Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sort contigs based on fasta header

    Hello All,

    I have a problem that I am looking for some input on.

    I have been using the genome assembler SPAdes, which outputs assembled contigs as a fasta file. I would like to sort my contigs based on average coverage (i.e. remove contigs with low coverage), which sounds like it should be a fairly easy task. However, all of the coverage information is contained within the fasta sequence header itself, for example (taken from my assembly file):

    Code:
    >NODE_100_length_628_cov_0.818363_ID_199
    So the average coverage for this particular contig is .8.
    The fact that the information is contained within the header itself and is not in a table format prevents me from doing some sort of copy/paste sorting shenanigans in Excel. So I figured I could write something up in Perl and use regular expression to sort based on the value of the numbers following
    Code:
    cov_
    . But I ran into some issues with that, likely because I am a beginner and I still don't really know what I'm doing. I know I need to use BioPerl for the sequence/multifasta handling, and I know I need to restrict the matching to the header only and not the sequence itself, and then I need a way to delete all sequences with headers that do not meet a certain value (e.g. all values less than 10).

    I've done some research via the almighty Google and come across people trying to complete similar tasks, but in all of the cases I found the individuals knew the EXACT header/sequence name of the sequence they wanted to extract. These methods are not very applicable to me since I am looking to sort sequences based on whether or not they meet a specific condition.

    Any input or advice to lead me in the right direction would be greatly appreciated. So far all my code does is read the input file *golf clap*
    I also know that because my file is relatively large (4MB) it is not efficient to have a script that reads everything line by line, but I'm not sure what else to do or how to address that issue.

    Please help! And thanks in advance,

    ~Ana

  • #2
    4MB is not a large file. Reading line by line would be fine. However your code was not enclosed in your post thus it is hard to say what is wrong with it.

    Comment


    • #3
      Sorry, it wasn't enclosed because as of right now it doesn't do anything other than read the file, so I didn't think posting it would be helpful. I don't know how to go about solving the other problems I stated in my question which is why I asked for input. But since you asked here it is:

      Code:
      use Bio::SeqIO;
      
      $seqio_obj = Bio::SeqIO->new (-file =>"/Users/annaliesejones/Desktop/Assembly6contigs.fasta", -format => "fasta" );
      
      while($seq_obj = $seqio_obj->next_seq) {
      
      print $seq_obj->seq;
      
      }
      Even that doesn't work well, it prints all sequences together without a newline or the fasta header. But that's not really the point, I just wanted to see if I could read the file.

      And now I'm stuck.

      Comment


      • #4
        1) Use a newline after printing ... e.g., 'print $seq_obj->seq . "\n" ' ... many ways to do this.

        2) Look at $seq_obj->display for your header information. From there you can use a regexp to pull out the information; e.g.,
        Code:
        (my $coverage) = $seq_obj->display =~ m/_cov_(.+)_ID/;

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X