SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract sequence from multi fasta file with PERL andreitudor Bioinformatics 27 07-07-2019 07:45 AM
How to Extract Multiple Sequence from Multi Fasta File by ID list Anti Bioinformatics 4 02-02-2015 04:02 AM
extract subset of fastq based on length sequence?? dapizarro Ion Torrent 12 08-26-2014 01:01 PM
Extract partial sequence from FASTA record cdlam Bioinformatics 9 10-30-2012 02:21 PM
Extract only sequence ids from fasta file with makeblastdb angeloulivieri Bioinformatics 13 07-30-2012 02:41 AM

Reply
 
Thread Tools
Old 11-09-2015, 09:01 AM   #1
Gabriel_
Junior Member
 
Location: Brest

Join Date: Oct 2015
Posts: 2
Default Extract a fasta sequence based on Id AND length

Hi all !

After several days looking and testing, I'm finally here, looking for help...

I have a large multi fasta file from RNA-seq which contain isoforms.

The file looks like this : (truncated for the example)

Code:
>comp32_c0_seq1 len=365 path=[18710:0-364]
CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
>comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp36_c0_seq1 len=275 path=[22213:0-274]
CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
Here the sequence ">comp32_c0_seq1 len=365 path=[18710:0-364]" is unique while the sequence ">comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]" and ">comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]" are isoforms (note the "seq1" and "seq2").


Tipically what I want is a small script that is able to
1) check the #1 Id whit the #2, #3, #4...
2) If the #1 Id is unique, pass to the #2 Id
3) if the #1 Id is not unique, then compare its length (from the "len=XXX" commentary string in the header or from the sequence itself, it doesn't matters) whit the length of the other Ids, and finally keep the longest.

Note that some sequence have more than 10 isoforms...

Is it something feasible ?

I would really appreciate if you guys could give me a hand on this..!

Thanks !

Gabriel.
Gabriel_ is offline   Reply With Quote
Old 11-09-2015, 09:22 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That would require some fancy scripting, though it's feasible. It could miss some exons, though.

Alternatively, you could run Dedupe (int the BBMap package) like this:

dedupe.sh in=contigs.fasta out=deduped.fasta

That will absorb all duplicate or contained sequences. The net effect will be to retain the longest transcript per gene (so if some transcript contains all the exons, it will keep that one). However, if there is alternative splicing such that some transcript contains a unique exon not found in other transcripts, it will keep that one, too. For most uses, this is probably a safer method.
Brian Bushnell is offline   Reply With Quote
Old 11-09-2015, 11:38 PM   #3
Gabriel_
Junior Member
 
Location: Brest

Join Date: Oct 2015
Posts: 2
Default

Hi, thanks for your quick reply.
I tried it out but it does not seem to do exactly what I want it to... I think dedupe looks for exact duplicates which results in this output :

Code:
Input:                  	216203 reads 		178150781 bases.
Duplicates:             	4 reads (0.00%) 	3003 bases (0.00%)     	0 collisions.
Containments:           	29 reads (0.01%) 	15742 bases (0.01%)    	177580 collisions.
Result:                 	216170 reads (99.98%) 	178132036 bases (99.99%)
While I know that there is more than 4 "duplicates" (i.e. isoforms) in the whole dataset...
For now, I don't really care if some isoforms are completely removed from my dataset... I really need to keep only the longest sequence for each Id.

Thank you again,
I'll try to find another way
Gabriel_ is offline   Reply With Quote
Old 11-10-2015, 09:23 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Gabriel,

It actually removes both exact duplicates and full containments. There were only 29 isoforms that were fully contained by other isoforms; there is no way to get a smaller subset of the data without losing unique sequence.

That said, the results are pretty surprising; looks like that method is not very effective in this case.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 11-10-2015, 12:20 PM   #5
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Use the following script at your own risk.
There could be bugs left, but they should be easy to fix.
I'll probably put the script on my GitHub account.

You do need Python3 and BioPython installed to be able to run it, which may prove to be an obstacle to a non-programmer.

I'll go back to writing code for which I actually get paid now.

collapseIsoForms.py

Code:
#!/usr/bin/env python3

import argparse

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Read the command line arguments.
parser = argparse.ArgumentParser(description="Collapses isoforms, keeping only the longest one.")
parser.add_argument("-i", "--input_file", help="Input FASTA file.", required=True)
parser.add_argument("-o", "--output_file", help="Output FASTA file with collapsed isoforms", required=True)
args = parser.parse_args()

# Process the command line arguments.
input_file = args.input_file
output_file = args.output_file

# Get FASTA file handle
fasta_sequences = SeqIO.parse(open(input_file),'fasta')

# Get output file handle
output_handle = open(output_file, "w")

# Create a variable to store the longest record
# Set it to the first record, to start
longest_seq_record = next(fasta_sequences)

#Process FASTA file
with open(output_file) as out_file:
    for seq_record in fasta_sequences:
        # Compare id of current seq_record to id of longest stored seq_record
        if (seq_record.id == longest_seq_record.id):
            # Compare lengths
            if(len(seq_record) > len(longest_seq_record)):
                # Store current record as the longest record to date.
                longest_seq_record.id = seq_record
        else:
            # New id. Print previous longest_seq_record to date.
            output_handle.write(longest_seq_record.format("fasta"), end="")
            # Reset longest_seq record.
            longest_seq_record = seq_record
Code:
[blancha@lg-1r17-n02 ~]$ collapseIsoforms.py -i=test.fa -o=test_collapsed.fa
[blancha@lg-1r17-n02 ~]$ more test.fa
>comp32_c0_seq1 len=365 path=[18710:0-364]
CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
>comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp36_c0_seq1 len=275 path=[22213:0-274]
CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
[blancha@lg-1r17-n02 ~]$ more test_collapsed.fa 
>comp32_c0_seq1 len=365 path=[18710:0-364]
CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
>comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG

Last edited by blancha; 11-10-2015 at 12:44 PM.
blancha is offline   Reply With Quote
Reply

Tags
different length, isoforms, multi fasta, same headers

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO