Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • File Conversion / Usage with Windows and linux

    Hi All,
    For the past few months I have been using CLC workbench on windows 7 to do de novo assembly of transcriptome sequences. Now, being happy with my contigs I want to move onto other tasks that are primarily in linux. I have had my computer reformatted to a dual boot system to run Ubuntu but I have run into the problem that the Fasta file generated in CLC (windows) will not work in linux.
    For example, I am unable to blast my fasta contig file against other databases. If I run formatdb I get an error message saying the file can't be opened. I am guessing this has to do with the differences between dos and unix in file formatting. I tried dos2unix commands in linux but I still cannot use the file. Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux? As a last resource I can switch CLC to linux but this will require me to change from Ubuntu to Redhat, and I have just finished installing quite a bit of software in ubuntu. I should say that I am completely new to linux.
    Thanks!

  • #2
    Hi,

    I noticed this post because someone in my office had a similar problem (miRNAkey on biolinux, basically Ubuntu, refused to open fasta files created by CLC on Windows).

    So the problem might be with the end-of-line character used, with the text encoding or with the way CLC writes a fasta file and the way formatdb reads it. In theory, dos2unix should take care of the first point. You could also try to open it in gedit (or another Linux text editor). I know that TextWrangler on my osx has an option to show 'hidden characters', I don't know about gedit? You could try opening the file in gedit and then saving it explicitly in the correct encoding (UTF-8 I think).

    In the end, the problem with miRNAkey had nothing to do with newlines or encoding. It expected the sequences to be on one line alone (which makes sense, they're miRNAs). CLC on the other hand used a more 'standard' way of writing fasta files: add a newline after so many bases (75? 60? dunno). However I doubt that formatdb has a problem with this. Check the docs to be save.

    Another problem this person faced was the way the files were copied. He used VMware instead of a dual boot, basically running linux in windows. When he copied the files from Win to Linux, he couldn't open them. When he copied them to USB, remounted the USB to the linux and the copied them, he was fine.

    Hope that helps,
    cheers

    Comment


    • #3
      Originally posted by Wiseone View Post
      Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux?
      Thanks!
      Could you share the your FASTA file with us? The problem looks interesting to me (I'm one of UGENE developers) and I think that as the result of the investigation I will be able both to test the tool and help you solving your problem.
      ---
      http://ugene.unipro.ru

      Comment


      • #4
        Just to rule out any issues that aren't OS-specific, is the file compatible with the Windows version of formatdb?

        ftp://ftp.ncbi.nlm.nih.gov/blast/exe...elease/LATEST/

        Comment


        • #5
          So, everything is now working. Bruins was correct. By opening the fatsa file in G Edit and saving with Linux line endings I was able to use the file.

          Comment


          • #6
            I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

            Code:
            my contigfile = $ARGV[0];
            
            open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
            my ($seq, $prevhead) = ('','');
            while(<IN>){
              s/\r\n/\n/;
              chomp;
              $seq.= $_ if(eof(IN));
              if (/\>(\S+)/ || eof(IN)){
                my $head=$_;
                if($seq ne ""){
                  print "$prevhead\n$seq\n";
                }
                $prevhead = $head;
                $seq = '';
              }else{
                $seq .= $_;
              }
            }
            close IN;
            Boetsie.

            Comment


            • #7
              Originally posted by boetsie View Post
              I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

              Code:
              my contigfile = $ARGV[0];
              
              open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
              my ($seq, $prevhead) = ('','');
              while(<IN>){
                s/\r\n/\n/;
                chomp;
                $seq.= $_ if(eof(IN));
                if (/\>(\S+)/ || eof(IN)){
                  my $head=$_;
                  if($seq ne ""){
                    print "$prevhead\n$seq\n";
                  }
                  $prevhead = $head;
                  $seq = '';
                }else{
                  $seq .= $_;
                }
              }
              close IN;
              Boetsie.
              it is simple, and if add output function will be more effective.
              my $txt;

              for (my $i=0;$i*50<length($seq) ; ){

              $txt.=substr($seq,$i*50,50)."\n";

              $i++;

              }

              print $head,"\n";
              print $txt;

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 05-10-2024, 06:35 AM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-09-2024, 02:46 PM
              0 responses
              26 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-07-2024, 06:57 AM
              0 responses
              21 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-06-2024, 07:17 AM
              0 responses
              21 views
              0 likes
              Last Post seqadmin  
              Working...
              X