SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
CLC Genomics Workbench - Windows vs. Linux figure002 Bioinformatics 24 12-06-2013 06:10 AM
Nice installation guide of cygwin to emulate linux on windows steven Bioinformatics 1 06-07-2012 05:36 AM
RTG Investigator 2.2.1: Now supports Mac OS X along with Linux, Windows Stuart Inglis Vendor Forum 0 08-03-2011 03:23 PM
Rookie's plee for advice: PERL or Python, Linux or Windows? A1_UltiMA Bioinformatics 7 11-29-2010 01:07 AM
Need help running Mosaik on Linux as well as windows. ketan_bnf Bioinformatics 0 11-22-2010 01:13 AM

Reply
 
Thread Tools
Old 12-16-2010, 01:52 PM   #1
Wiseone
Junior Member
 
Location: Canada

Join Date: Apr 2010
Posts: 7
Default File Conversion / Usage with Windows and linux

Hi All,
For the past few months I have been using CLC workbench on windows 7 to do de novo assembly of transcriptome sequences. Now, being happy with my contigs I want to move onto other tasks that are primarily in linux. I have had my computer reformatted to a dual boot system to run Ubuntu but I have run into the problem that the Fasta file generated in CLC (windows) will not work in linux.
For example, I am unable to blast my fasta contig file against other databases. If I run formatdb I get an error message saying the file can't be opened. I am guessing this has to do with the differences between dos and unix in file formatting. I tried dos2unix commands in linux but I still cannot use the file. Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux? As a last resource I can switch CLC to linux but this will require me to change from Ubuntu to Redhat, and I have just finished installing quite a bit of software in ubuntu. I should say that I am completely new to linux.
Thanks!
Wiseone is offline   Reply With Quote
Old 12-17-2010, 02:08 AM   #2
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

Hi,

I noticed this post because someone in my office had a similar problem (miRNAkey on biolinux, basically Ubuntu, refused to open fasta files created by CLC on Windows).

So the problem might be with the end-of-line character used, with the text encoding or with the way CLC writes a fasta file and the way formatdb reads it. In theory, dos2unix should take care of the first point. You could also try to open it in gedit (or another Linux text editor). I know that TextWrangler on my osx has an option to show 'hidden characters', I don't know about gedit? You could try opening the file in gedit and then saving it explicitly in the correct encoding (UTF-8 I think).

In the end, the problem with miRNAkey had nothing to do with newlines or encoding. It expected the sequences to be on one line alone (which makes sense, they're miRNAs). CLC on the other hand used a more 'standard' way of writing fasta files: add a newline after so many bases (75? 60? dunno). However I doubt that formatdb has a problem with this. Check the docs to be save.

Another problem this person faced was the way the files were copied. He used VMware instead of a dual boot, basically running linux in windows. When he copied the files from Win to Linux, he couldn't open them. When he copied them to USB, remounted the USB to the linux and the copied them, he was fine.

Hope that helps,
cheers
Bruins is offline   Reply With Quote
Old 12-17-2010, 03:15 AM   #3
mfursov
Junior Member
 
Location: Russia

Join Date: Dec 2009
Posts: 6
Default

Quote:
Originally Posted by Wiseone View Post
Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux?
Thanks!
Could you share the your FASTA file with us? The problem looks interesting to me (I'm one of UGENE developers) and I think that as the result of the investigation I will be able both to test the tool and help you solving your problem.
mfursov is offline   Reply With Quote
Old 12-20-2010, 05:27 AM   #4
RDW
Member
 
Location: London

Join Date: Oct 2008
Posts: 63
Default

Just to rule out any issues that aren't OS-specific, is the file compatible with the Windows version of formatdb?

ftp://ftp.ncbi.nlm.nih.gov/blast/exe...elease/LATEST/
RDW is offline   Reply With Quote
Old 12-23-2010, 12:47 PM   #5
Wiseone
Junior Member
 
Location: Canada

Join Date: Apr 2010
Posts: 7
Default

So, everything is now working. Bruins was correct. By opening the fatsa file in G Edit and saving with Linux line endings I was able to use the file.
Wiseone is offline   Reply With Quote
Old 12-23-2010, 01:24 PM   #6
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

Code:
my contigfile = $ARGV[0];

open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
my ($seq, $prevhead) = ('','');
while(<IN>){
  s/\r\n/\n/;
  chomp;
  $seq.= $_ if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    my $head=$_;
    if($seq ne ""){
      print "$prevhead\n$seq\n";
    }
    $prevhead = $head;
    $seq = '';
  }else{
    $seq .= $_;
  }
}
close IN;
Boetsie.
boetsie is offline   Reply With Quote
Old 01-02-2011, 03:23 AM   #7
skycreative
Member
 
Location: GuangXi China

Join Date: Jan 2010
Posts: 27
Default

Quote:
Originally Posted by boetsie View Post
I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

Code:
my contigfile = $ARGV[0];

open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
my ($seq, $prevhead) = ('','');
while(<IN>){
  s/\r\n/\n/;
  chomp;
  $seq.= $_ if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
    my $head=$_;
    if($seq ne ""){
      print "$prevhead\n$seq\n";
    }
    $prevhead = $head;
    $seq = '';
  }else{
    $seq .= $_;
  }
}
close IN;
Boetsie.
it is simple, and if add output function will be more effective.
my $txt;

for (my $i=0;$i*50<length($seq) ; ){

$txt.=substr($seq,$i*50,50)."\n";

$i++;

}

print $head,"\n";
print $txt;
skycreative is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO