Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare fasta files

    Hi there,

    I want to compare several fasta files containing sequences. These sequences are transcripts obtained from RNA-Seq. I want to find out the shared transcripts between samples.
    I cannot use CuffCompare or similar because I have no reference genome. I only have transcripts.

    Thanks in advance,

  • #2
    You could try CD-HIT to cluster the reads.

    Comment


    • #3
      Try using command line utilities
      cat
      sort
      uniq


      example :
      #get unique reads for 1, filter out read names (lines with >)
      cat 1.fa | grep -v ">" | sort | uniq > 1.tmp
      #get unique reads for 2
      cat 2.fa | grep -v ">" | sort | uniq > 2.tmp
      #get reads common to 1 and 2
      cat 1.tmp 2.tmp | sort | uniq -d


      sort takes a "more RAM memory" parameter if it's a large data files.
      check out the manual using "man sort" for details

      Comment


      • #4
        BLAT appears the easiest and straightforward way right?

        Comment


        • #5
          Check out bl2seq ...


          There's a command line version if your into that kind of stuff.

          Comment


          • #6
            Thanks to all, I am very grateful for your help,

            This is my opinion:

            (i) CD-HIT seems interesting, but I have not test it yet.

            (ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).

            (iii) 'BLAT' needs a reference genome, and I do not have such.

            (iv) 'bl2seq' does not support large size files, so that they suggested to use Blast+. So it is the same as running a local blast with Blast+.

            Is this correct?

            Comment


            • #7
              Originally posted by Hel View Post
              (iii) 'BLAT' needs a reference genome, and I do not have such.
              Blat does not need a reference genome. In fact you use blat with just two files (which can be single sequences or multi-fasta files). First file on the command line serves as the "database" and the second "query". So in your case you will be blatting a sequence (actually many of them sequentially) against one "database" file (or the whole lot of files concatenated together). Ideally the sequence itself will be the top hit. You may want to use tabular format to be able to parse the results easily.

              Comment


              • #8
                Originally posted by Hel View Post

                (ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).
                You could remove the linebreaks in seqs and then continue as Richard advised..

                Code:
                awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' file.fa > out.fa
                Last edited by rhinoceros; 05-18-2015, 04:04 AM.
                savetherhino.org

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:57 AM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-02-2024, 08:06 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-30-2024, 12:17 PM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Working...
                X