Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concatenating Sequences Within a single Fasta File

    Hi all,
    I need to concatenate a bunch of sequences in a FASTA file. I have a file of extracted introns and would like to essentially splice them all together for use in a program. Is there any way to do either of these using perl (preferably) or python (if necessary):

    1. Join all the introns of a single gene, preserving the FASTA heading for that gene.

    > Gene 1 Intron 1
    GTACGCC....CTGATAGAG
    >Gene 1 Intron 2
    GTCCAGGAC.....CTGAGTAAG

    Becomes
    > Gene 1 Intron 1
    GTACGCC....CTGATAGAGGTCCAGGAC.....CTGAGTAAG

    or
    2. Join a number of introns together (not accounting for what genes they came from) under a non-specific FASTA formatted heading?

    Basically I want to splice together a bunch of intron sequences like they were exons so that I can run them through a program that doesn't like how short they are. The first way would be the most biologically relevant and useful for my purposes, but if it can't be done I can live with it haha. Any help would be greatly appreciated. Thanks a lot!

  • #2
    I don't think that there is a program out there that will do what you want. In part this is because FastA headers are not standardized. Basically all you need for FastA is the '>'. So it would be hard to create a general purpose program that would combine FastA sequences together based on some regular yet arbitrary criteria. That said it would be trivial (for a programmer, at least) to create a one-shot specific program that would do the combining.

    Now that I think of it, such an program would be a good one to give to a beginning programmer. Straight-forward yet not so trivial as to be boring.
    Last edited by westerman; 12-04-2012, 07:54 AM. Reason: Added words, "one-shot specific"

    Comment


    • #3
      A little fiddly but you can use the text manipulation tool in Galaxy.

      Convert your fastas to tabula formats....and use the cut column and paste files side by side to get your sequences in the same file.... you can then merge columns and convert back to fast

      Comment


      • #4
        Jackie: I'll agree that conversion to tabular is a good first step -- and this can be done via the command line as well using the FastX tools -- but I don't see how a merger could be done automatically. Let's say after the conversion I have

        Gene 1 Intron 1 <tab> GTACGCC....CTGATAGAG
        Gene 1 Intron 2 <tab> GTCCAGGAC.....CTGAGTAAG
        Gene 2 Intron 1 <tab> sequence
        Gene 3 Intron 1 <tab> sequence
        Gene 3 Intron 2 <tab> sequence
        Gene 3 Intron 3 <tab> sequence
        etc.

        How to pull all of the Gene1, Gene2, Gene3, etc. sequences together without some programming? I'm not familiar enough with Galaxy to follow your "cut column and paste files" suggestion but if it is anything like the normal Unix 'cut' and 'paste' commands then I just don't see how to do the cut'n'paste automatically.

        Perhaps if the tabular file becomes:

        Gene<tab>1<tab>Intron<tab>1<tab>Sequence
        Gene<tab>1<tab>Intron<tab>2<tab>Sequence
        etc.

        Then maybe cut'n'paste would work.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        56 views
        0 likes
        Last Post seqadmin  
        Working...
        X