Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • File Conversion / Usage with Windows and linux

    Hi All,
    For the past few months I have been using CLC workbench on windows 7 to do de novo assembly of transcriptome sequences. Now, being happy with my contigs I want to move onto other tasks that are primarily in linux. I have had my computer reformatted to a dual boot system to run Ubuntu but I have run into the problem that the Fasta file generated in CLC (windows) will not work in linux.
    For example, I am unable to blast my fasta contig file against other databases. If I run formatdb I get an error message saying the file can't be opened. I am guessing this has to do with the differences between dos and unix in file formatting. I tried dos2unix commands in linux but I still cannot use the file. Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux? As a last resource I can switch CLC to linux but this will require me to change from Ubuntu to Redhat, and I have just finished installing quite a bit of software in ubuntu. I should say that I am completely new to linux.
    Thanks!

  • #2
    Hi,

    I noticed this post because someone in my office had a similar problem (miRNAkey on biolinux, basically Ubuntu, refused to open fasta files created by CLC on Windows).

    So the problem might be with the end-of-line character used, with the text encoding or with the way CLC writes a fasta file and the way formatdb reads it. In theory, dos2unix should take care of the first point. You could also try to open it in gedit (or another Linux text editor). I know that TextWrangler on my osx has an option to show 'hidden characters', I don't know about gedit? You could try opening the file in gedit and then saving it explicitly in the correct encoding (UTF-8 I think).

    In the end, the problem with miRNAkey had nothing to do with newlines or encoding. It expected the sequences to be on one line alone (which makes sense, they're miRNAs). CLC on the other hand used a more 'standard' way of writing fasta files: add a newline after so many bases (75? 60? dunno). However I doubt that formatdb has a problem with this. Check the docs to be save.

    Another problem this person faced was the way the files were copied. He used VMware instead of a dual boot, basically running linux in windows. When he copied the files from Win to Linux, he couldn't open them. When he copied them to USB, remounted the USB to the linux and the copied them, he was fine.

    Hope that helps,
    cheers

    Comment


    • #3
      Originally posted by Wiseone View Post
      Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux?
      Thanks!
      Could you share the your FASTA file with us? The problem looks interesting to me (I'm one of UGENE developers) and I think that as the result of the investigation I will be able both to test the tool and help you solving your problem.
      ---
      http://ugene.unipro.ru

      Comment


      • #4
        Just to rule out any issues that aren't OS-specific, is the file compatible with the Windows version of formatdb?

        ftp://ftp.ncbi.nlm.nih.gov/blast/exe...elease/LATEST/

        Comment


        • #5
          So, everything is now working. Bruins was correct. By opening the fatsa file in G Edit and saving with Linux line endings I was able to use the file.

          Comment


          • #6
            I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

            Code:
            my contigfile = $ARGV[0];
            
            open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
            my ($seq, $prevhead) = ('','');
            while(<IN>){
              s/\r\n/\n/;
              chomp;
              $seq.= $_ if(eof(IN));
              if (/\>(\S+)/ || eof(IN)){
                my $head=$_;
                if($seq ne ""){
                  print "$prevhead\n$seq\n";
                }
                $prevhead = $head;
                $seq = '';
              }else{
                $seq .= $_;
              }
            }
            close IN;
            Boetsie.

            Comment


            • #7
              Originally posted by boetsie View Post
              I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

              Code:
              my contigfile = $ARGV[0];
              
              open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
              my ($seq, $prevhead) = ('','');
              while(<IN>){
                s/\r\n/\n/;
                chomp;
                $seq.= $_ if(eof(IN));
                if (/\>(\S+)/ || eof(IN)){
                  my $head=$_;
                  if($seq ne ""){
                    print "$prevhead\n$seq\n";
                  }
                  $prevhead = $head;
                  $seq = '';
                }else{
                  $seq .= $_;
                }
              }
              close IN;
              Boetsie.
              it is simple, and if add output function will be more effective.
              my $txt;

              for (my $i=0;$i*50<length($seq) ; ){

              $txt.=substr($seq,$i*50,50)."\n";

              $i++;

              }

              print $head,"\n";
              print $txt;

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              23 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              21 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X