Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare fasta files

    Hi there,

    I want to compare several fasta files containing sequences. These sequences are transcripts obtained from RNA-Seq. I want to find out the shared transcripts between samples.
    I cannot use CuffCompare or similar because I have no reference genome. I only have transcripts.

    Thanks in advance,

  • #2
    You could try CD-HIT to cluster the reads.

    Comment


    • #3
      Try using command line utilities
      cat
      sort
      uniq


      example :
      #get unique reads for 1, filter out read names (lines with >)
      cat 1.fa | grep -v ">" | sort | uniq > 1.tmp
      #get unique reads for 2
      cat 2.fa | grep -v ">" | sort | uniq > 2.tmp
      #get reads common to 1 and 2
      cat 1.tmp 2.tmp | sort | uniq -d


      sort takes a "more RAM memory" parameter if it's a large data files.
      check out the manual using "man sort" for details

      Comment


      • #4
        BLAT appears the easiest and straightforward way right?

        Comment


        • #5
          Check out bl2seq ...


          There's a command line version if your into that kind of stuff.

          Comment


          • #6
            Thanks to all, I am very grateful for your help,

            This is my opinion:

            (i) CD-HIT seems interesting, but I have not test it yet.

            (ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).

            (iii) 'BLAT' needs a reference genome, and I do not have such.

            (iv) 'bl2seq' does not support large size files, so that they suggested to use Blast+. So it is the same as running a local blast with Blast+.

            Is this correct?

            Comment


            • #7
              Originally posted by Hel View Post
              (iii) 'BLAT' needs a reference genome, and I do not have such.
              Blat does not need a reference genome. In fact you use blat with just two files (which can be single sequences or multi-fasta files). First file on the command line serves as the "database" and the second "query". So in your case you will be blatting a sequence (actually many of them sequentially) against one "database" file (or the whole lot of files concatenated together). Ideally the sequence itself will be the top hit. You may want to use tabular format to be able to parse the results easily.

              Comment


              • #8
                Originally posted by Hel View Post

                (ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).
                You could remove the linebreaks in seqs and then continue as Richard advised..

                Code:
                awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' file.fa > out.fa
                Last edited by rhinoceros; 05-18-2015, 04:04 AM.
                savetherhino.org

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                50 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X