Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract a fasta sequence based on Id AND length

    Hi all !

    After several days looking and testing, I'm finally here, looking for help...

    I have a large multi fasta file from RNA-seq which contain isoforms.

    The file looks like this : (truncated for the example)

    Code:
    >comp32_c0_seq1 len=365 path=[18710:0-364]
    CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
    >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
    ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
    AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
    >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
    ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
    AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
    >comp36_c0_seq1 len=275 path=[22213:0-274]
    CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
    Here the sequence ">comp32_c0_seq1 len=365 path=[18710:0-364]" is unique while the sequence ">comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]" and ">comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]" are isoforms (note the "seq1" and "seq2").


    Tipically what I want is a small script that is able to
    1) check the #1 Id whit the #2, #3, #4...
    2) If the #1 Id is unique, pass to the #2 Id
    3) if the #1 Id is not unique, then compare its length (from the "len=XXX" commentary string in the header or from the sequence itself, it doesn't matters) whit the length of the other Ids, and finally keep the longest.

    Note that some sequence have more than 10 isoforms...

    Is it something feasible ?

    I would really appreciate if you guys could give me a hand on this..!

    Thanks !

    Gabriel.

  • #2
    That would require some fancy scripting, though it's feasible. It could miss some exons, though.

    Alternatively, you could run Dedupe (int the BBMap package) like this:

    dedupe.sh in=contigs.fasta out=deduped.fasta

    That will absorb all duplicate or contained sequences. The net effect will be to retain the longest transcript per gene (so if some transcript contains all the exons, it will keep that one). However, if there is alternative splicing such that some transcript contains a unique exon not found in other transcripts, it will keep that one, too. For most uses, this is probably a safer method.

    Comment


    • #3
      Hi, thanks for your quick reply.
      I tried it out but it does not seem to do exactly what I want it to... I think dedupe looks for exact duplicates which results in this output :

      Code:
      Input:                  	216203 reads 		178150781 bases.
      Duplicates:             	4 reads (0.00%) 	3003 bases (0.00%)     	0 collisions.
      Containments:           	29 reads (0.01%) 	15742 bases (0.01%)    	177580 collisions.
      Result:                 	216170 reads (99.98%) 	178132036 bases (99.99%)
      While I know that there is more than 4 "duplicates" (i.e. isoforms) in the whole dataset...
      For now, I don't really care if some isoforms are completely removed from my dataset... I really need to keep only the longest sequence for each Id.

      Thank you again,
      I'll try to find another way

      Comment


      • #4
        Hi Gabriel,

        It actually removes both exact duplicates and full containments. There were only 29 isoforms that were fully contained by other isoforms; there is no way to get a smaller subset of the data without losing unique sequence.

        That said, the results are pretty surprising; looks like that method is not very effective in this case.

        -Brian

        Comment


        • #5
          Use the following script at your own risk.
          There could be bugs left, but they should be easy to fix.
          I'll probably put the script on my GitHub account.

          You do need Python3 and BioPython installed to be able to run it, which may prove to be an obstacle to a non-programmer.

          I'll go back to writing code for which I actually get paid now.

          collapseIsoForms.py

          Code:
          #!/usr/bin/env python3
          
          import argparse
          
          from Bio import SeqIO
          from Bio.Seq import Seq
          from Bio.SeqRecord import SeqRecord
          
          # Read the command line arguments.
          parser = argparse.ArgumentParser(description="Collapses isoforms, keeping only the longest one.")
          parser.add_argument("-i", "--input_file", help="Input FASTA file.", required=True)
          parser.add_argument("-o", "--output_file", help="Output FASTA file with collapsed isoforms", required=True)
          args = parser.parse_args()
          
          # Process the command line arguments.
          input_file = args.input_file
          output_file = args.output_file
          
          # Get FASTA file handle
          fasta_sequences = SeqIO.parse(open(input_file),'fasta')
          
          # Get output file handle
          output_handle = open(output_file, "w")
          
          # Create a variable to store the longest record
          # Set it to the first record, to start
          longest_seq_record = next(fasta_sequences)
          
          #Process FASTA file
          with open(output_file) as out_file:
              for seq_record in fasta_sequences:
                  # Compare id of current seq_record to id of longest stored seq_record
                  if (seq_record.id == longest_seq_record.id):
                      # Compare lengths
                      if(len(seq_record) > len(longest_seq_record)):
                          # Store current record as the longest record to date.
                          longest_seq_record.id = seq_record
                  else:
                      # New id. Print previous longest_seq_record to date.
                      output_handle.write(longest_seq_record.format("fasta"), end="")
                      # Reset longest_seq record.
                      longest_seq_record = seq_record
          Code:
          [blancha@lg-1r17-n02 ~]$ collapseIsoforms.py -i=test.fa -o=test_collapsed.fa
          [blancha@lg-1r17-n02 ~]$ more test.fa
          >comp32_c0_seq1 len=365 path=[18710:0-364]
          CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
          >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp36_c0_seq1 len=275 path=[22213:0-274]
          CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
          [blancha@lg-1r17-n02 ~]$ more test_collapsed.fa 
          >comp32_c0_seq1 len=365 path=[18710:0-364]
          CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
          >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          Last edited by blancha; 11-10-2015, 01:44 PM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          48 views
          0 likes
          Last Post seqadmin  
          Working...
          X