Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Completely renaming Fasta headers

    This may sound like a trivial question for most folks on this forum. Apologies, I a newbie.

    I have FASTA sequences from GenBank with unique sequence identifiers. For example, one looks something like this:

    >gi|74026815|gb|DQ107070.1| Feline immunodeficiency virus isolate Ac002pA3 pol protein (pol) gene, partial cds
    AGAGCAGATCCTAACAATCCCTGGAATACCCCTATATTTTGTATAAAGAAGAAATCAGGAAAATGGAGAATGTTAATAGATTTTAGAGAATTGAATGCAAAGACTGAGAAAGGAGCAGAAGTACAGTTAGGATTGCCTCA.....

    I would like to change the above header theoretically for all my sequences with names that are unique to where the sample was isolated. i.e.,

    >Yellowstone
    AGAGCAGATCCTAACAATCCCTGGAATACCCCTATATTTTGTATAAAGAAGAAATCAGGAAAATGGAGAATGTTAATAGATTTTAGAGAATTGAATGCAAAGACTGAGAAAGGAGCAGAAGTACAGTTAGGATTGCCTCA.....

    Is there a script out there that allows one to do this all simultaneously?

    Thank you kindly,
    Nick

  • #2
    AFAIK, GenBank identifiers are not linked to geographical metadata, so unless you have the said data in e.g. a two column table (gi - location), no.
    Last edited by rhinoceros; 09-21-2013, 02:20 AM.
    savetherhino.org

    Comment


    • #3
      Not sure if I understand, but if you already have separate fasta files that need renaming, something like this would do it:

      Code:
      sed 's/>.*/>Yellowstone/' INFILE.fa > OUTFILE.fa

      Comment


      • #4
        Thank you. To be more clear, I have 1 .fasta file with multiple sequence alignments. I want to rename all the headers with the names of certain geographical localities, depending on the isolates. Now that I am writing this, it doesn't seem possible.

        Kindly,
        Nick

        Comment


        • #5
          Do you have the geographical data in some format?

          Comment


          • #6
            This seems a big trouble some time.
            I am also facing the same problem. I have 7.7 GB FQ file and want to rename their header completely.
            Pls suggest.

            Comment


            • #7
              From what to what?

              Comment


              • #8
                i have illumina reads by name of
                >FCC1047ACXX:1:1101:1991:2224#GTTCGACA/1 1 1
                >FCC1047ACXX:1:1101:1991:2224#GTTCGACA/2 1 1
                I want to rename all these with "sequence 1"
                I am aligning these reads over to a genome using MUMMer but it is showing error
                Duplicate read....ignored.

                Pls suggest ....
                Thanks in advance....

                Comment


                • #9
                  Are you sure you have a fastq file? IDs for a fastq should start with an '@' character. Yours appear to start with a '>', which is the format for fasta files.

                  I'm assuming you want to change each line to be numbered sequentially, not change them all to "sequence 1", as then they would all presumably count as duplicate reads?

                  These are all do-able using relatively simple commands (particularly sed and/or awk), but we just need to know exactly what it is you're trying to do to what before we can suggest some. Maybe if you give us a sample of what your data looks like now, and how you want it to look?

                  Comment


                  • #10
                    Yes Jamie you are right....
                    I have fasta sequences and they appear like this

                    >FCC1047ACXX:1:1101:1991:2224#GTTCGACA/1 1 1
                    AGAGCAGATCCTAACAATCCCTGGAATACCCCTATATTT
                    >FCC1047ACXX:1:1101:1991:2224#GTTCGACA/2 1 1
                    GAAATCAGGAAAATGGAGAATGTTAATAGATTTTAGAGAA

                    and i want to rename them like this

                    > sequence 1
                    AGAGCAGATCCTAACAATCCCTGGAATACCCCTATATTT

                    > sequence 2
                    GAAATCAGGAAAATGGAGAATGTTAATAGATTTTAGAGAA

                    Comment


                    • #11
                      You can do that with BBTools.

                      bbrename.sh in=reads.fasta out=renamed.fasta prefix=sequence

                      But, note that your reads are paired and interleaved, so I suggest not remaining them "sequence_1" then "sequence_2", but rather "sequence_1 /1" and "sequence_1 /2" then "sequence_2 /1" and "sequence_2 /2", etc, which will keep the pairing information for downstream programs to use. To do that, you would just tell the tool that the reads are interleaved, like this:

                      bbrename.sh in=reads.fasta out=renamed.fasta prefix=sequence int=t

                      Comment


                      • #12
                        Hello Brian,

                        Can you also share the link for BBTools.

                        Thanks

                        Comment


                        • #13
                          Certainly; it's here:

                          Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 11:49 AM
                          0 responses
                          13 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-24-2024, 08:47 AM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          61 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X