SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
just perl script semna Bioinformatics 3 07-02-2011 08:42 AM
vcftools perl script weiyulin Bioinformatics 6 12-09-2010 02:13 PM
perl script bioenvisage Bioinformatics 5 02-01-2010 08:11 AM
perl script bioenvisage Bioinformatics 0 02-01-2010 07:23 AM
Perl script bioenvisage Bioinformatics 4 01-28-2010 12:25 PM

Reply
 
Thread Tools
Old 07-13-2010, 12:09 AM   #1
Adamo
Member
 
Location: Paris

Join Date: Jun 2010
Posts: 28
Default [Optimization] perl script for unaligned reads

Hello,

I have intended to write a script to extract unaligned reads from a fasta file.

The script takes the first line of the file which contains all the reads, and run throught the entire one (the one containing all aligned reads), until it finds a match between two reads IDs. If it doesn't, then it means the read is unaligned and it is stored into the file unaligned.fasta (aligned.fasta in the other case).

The problem is the script runs very, very slowly, and I don't know how to optimize it.
Any help would be very appreciated.

The code (I made no comments of it, do not hesitate to tell me if you want me to do so):

Code:
#!/usr/bin/perl

use Bio::SeqIO;
use Bio::Root::Exception;


my $usage = "unalignedreads.pl alignedreads totalreads\n";
my $alignedreads = shift or die $usage;
my $totalreads = shift or die $usage;
$inseq1 = Bio::SeqIO->new('-file' => "<$totalreads",
			     '-format' => 'fasta');
$inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
			     '-format' => 'fasta');
my %outfiles = (
		'aligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/aligned.fasta',
					     '-format'=> 'fasta'),
		'unaligned' => Bio::SeqIO->new('-file'=> '>/home/biopuce11/Documents/reads/unaligned.fasta',
					       '-format' => 'fasta')
		);
while ($seqin1 = $inseq1->next_seq) {

	$flag = 0;
	while (($seqin2 = $inseq2->next_seq) && $flag == 0) {

		$sequ2 = $seqin2->display_id;
		$sequ1 = $seqin1->display_id;
		chomp($sequ1);
		chomp($sequ2);
		if ( "$sequ1" eq "$sequ2"){
			$flag = 1;
		} 			
	}
	$inseq2 = Bio::SeqIO->new('-file' => "<$alignedreads",
			     '-format' => 'fasta');

	if ( $flag == 1) {
		$outfiles{'aligned'}->write_seq($seqin1);
	} 
	else {
		$outfiles{'unaligned'}->write_seq($seqin1);
	}
}
exit;
Thanks in advance.
Adamo is offline   Reply With Quote
Old 07-13-2010, 01:09 AM   #2
Dethecor
Member
 
Location: Germany

Join Date: May 2010
Posts: 24
Default Set intersection

Now, I'm not an expert in (Bio-)Perl, but what you are doing seems to be somewhere in the time complexity of O( |all reads| * |aligned reads| ) with a lot of file connections being opened.

Depending on how many reads there are, you might be better of just reading both sets of read-ids ('all' and 'aligned') into an appropriate data-structure and do a set intersection.
Then use theses lists of id's to parse your .fasta-files and split them into 'unaligned' and 'aligned' (is there a Set with operations 'union', 'intersection', etc. defined in Perl?)

Cheers
__________________

"You are only young once, but you can stay immature indefinitely."
Dethecor is offline   Reply With Quote
Old 07-13-2010, 04:13 AM   #3
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

First, you read in both 'aligned reads' and 'unaligned reads'. This takes some time.
Within the while loop, you read in the unaligned reads (variable $inseq2) again. This means that for every sequence in 'aligned reads' the program reads in the unaligned reads. Find a way around this and save a LOT of time. Perhaps rename the first $seqin2 in
Code:
while (($seqin2 = $inseq2->next_seq) && $flag == 0) {
Chrz
Bruins is offline   Reply With Quote
Old 07-13-2010, 06:21 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,165
Default

Dethecor is correct, you are doing this in the most complex (time wise) manner possible. He is also correct that this is formally a set operation, and the operation you want to perform is a 'difference'. We can also make a simplifying assumption, that being that Aligned reads is a proper subset of Total reads.

Here is how I would do it:

- Read in the aligned.fasta, storing the display_ids in a hash.

- Read through the total.fasta, check if the current display_id is defined in your hash.

- If it is not defined, write the current seq object to the unaligned.fasta file

(I don't see the need to write out an aligned.fasta file since that file already exists; it was one of your input files after all.)

This method opens and reads each of the files (aligned.fasta, total.fasta) only once. A single comparison is done for each member of total.fasta.

Last edited by kmcarr; 07-13-2010 at 06:26 AM.
kmcarr is offline   Reply With Quote
Old 07-15-2010, 03:10 AM   #5
Adamo
Member
 
Location: Paris

Join Date: Jun 2010
Posts: 28
Default

Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
And you are right, aligned.fasta isn't useful at all, don't know why I did that.
It is very much faster this way!

@Bruins: Thank you too for your suggestion, it's fine now.
Adamo is offline   Reply With Quote
Old 07-15-2010, 05:20 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,165
Default

Quote:
Originally Posted by Adamo View Post
Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
Adamo,

Are you using an array or a hash to store the ids of the aligned reads? If you are using an array you then for each id in the total reads file you would need to scan through the array looking for the id of interest; this is inefficient. You should store the aligned read ids as keys in a hash (the values associated with these keys are irrelevant, setting them equal to 1 will be fine). Then for each id in your total reads file you only have to perform a single boolean operation, asking is this id defined as key in my hash?
kmcarr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO