Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • nucleotide sequence extraction

    I wish to extract part of a sequence from a particular sequence/scaffold ID like 437 to 959 bases from a 3 Mb scaffold.

    I am more familiar with grep and used it before for like:
    grep -A 1 scaffoldID sequencefasta.fa > saveoutput.fa

    but don't know how to extract a particular part of the sequence.

    Could anyone help me with this please.

    S

  • #2
    Have a look at Galaxy.

    Alternatively you can use Biopieces like this:

    Code:
    read_fasta -i input.fasta |
    grab -p scaffoldID -k SEQ_NAME |
    extract_seq -b 437 -e 959 |
    write_fasta -o output.fasta -x

    Martin

    Comment


    • #3
      You could also use bedtools (code.google.com/p/bedtools/). I've used this tool to extract sub-sequence data before and I really like it because its fast and efficient.

      The tool in bedtools is called fastaFromBed (Creates FASTA sequences based on intervals in a BED/GFF/VCF file) and can extract sub-regions of a fasta by specifying those regions in a bed file.

      The manual is present here: http://code.google.com/p/bedtools/do...-Manual.v3.pdf

      Example of the command from the mannual

      fastaFromBed [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output
      FASTA>

      Comment


      • #4
        Thanks Maasha and NextGenGirl,
        I could not install these tools in my system. Scaffold name and sequence ID name are same. Could you please suggest solution from perl (like grep) only? I am using biolinux.
        Regards,
        S

        Comment


        • #5
          I assume you are perhaps missing a compiler (gcc)/libraries when you say that you could not install these tools.

          Are you using a "live" image of biolinux to temporarily boot into a unix environment or are you using someone else's biolinux machine?


          Originally posted by struggler View Post
          I could not install these tools in my system. Scaffold name and sequence ID name are same. Could you please suggest solution from perl (like grep) only? I am using biolinux.
          Regards,
          S

          Comment


          • #6
            Thanks for your message.

            This is on my own machine through VMWare. I guess I can install these using SUDO command. Instead of 'could not' it is more like I was afraid or sceptical to install these tools as if anything goes messy then I don't have much knowhow to correct it. So, I don't want to play with my standard installation.
            Regards,
            S

            Comment


            • #7
              Give it a try. This is something you need to learn if you are planning to keep using *nix in some form.

              I doubt that you can cause major damage by installing bedtools ... but if you did manage to do that then perhaps you should not be using *nix in the first place

              I have not used VMWare lately. Are there any tools that allow you to make a backup of the image so just in case something does go wrong you can revert back to the old image.

              Originally posted by struggler View Post
              Thanks for your message.

              This is on my own machine through VMWare. I guess I can install these using SUDO command. Instead of 'could not' it is more like I was afraid or sceptical to install these tools as if anything goes messy then I don't have much knowhow to correct it. So, I don't want to play with my standard installation.
              Regards,
              S
              Last edited by GenoMax; 05-16-2012, 09:06 AM.

              Comment


              • #8
                Although my username tells my status of knowledge but with your encouragement I shall give it a try sometime later.
                Regards,
                S

                Comment


                • #9
                  Hi struggler,

                  I agree with GenoMax. Try and install these tools. Otherwise, if you are concerned about that maasha's suggestion of Galaxy is also good. They have a tool there under Fetch sequences called "Extract Genomic DNA" and that is the tool I used to use before I learned how to use unix.

                  Comment


                  • #10
                    EMBOSS (http://emboss.sourceforge.net/) is probably the most useful package for basic sequence manipulation/analysis.

                    Note that in order to utilize stdin/stdout you need to call the '-filter' flag and the '-auto' flag disables the parameter prompting. Their manual on the website is very informative.

                    I hope this helps!

                    Comment


                    • #11
                      @struggler .. try this

                      #fasta file: pa101.fasta
                      >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
                      QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
                      KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
                      VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
                      FLFLIKHNPTNTIVYFGRYWSP
                      #script: sequence_extractor.sh
                      #!/bin/bash

                      # The 1 based sequence extractor - sequence_extractor.sh
                      # No guarantees offered.

                      # usage:
                      # 1) download the script or copy the contents
                      # of the script and save it as sequence_extractor.sh
                      # 2) make it executable: chmod 755 sequence_extractor.sh
                      # reads from standard input or command line
                      # 3) run the script: ./sequence_extractor.sh ps101.fasta 4 6

                      # create a backup copy of the input fasta file
                      # and delete the header
                      sed -i.tmp -e '1d' $1 || exit $?

                      # merge the lines
                      temp_var1=`awk '{printf $0;}' $1` || exit $?

                      # select the region
                      temp_var2=$(((($3-1)-($2-1))+1)) || exit $?

                      # display the extracted sequence
                      echo ${temp_var1:$(($2-1)):$temp_var2} && mv $1.tmp $1 || exit $?

                      Comment


                      • #12
                        From the ncbi toolkit, formatdb and fastacmd works nicely

                        first format your sequence file

                        formatdb -i <fasta sequence file> -p F -o T


                        This creates a blastable sequence db (a useful bonus). The "o" flag makes it searchable by fastacmd

                        then

                        fastacmd -d <fasta sequence file> -o <output file name> -p F -s <ID of record you want to retrieve> -L <start position,end_position>

                        Fastacmd can also retrieve many records at once. See the documentation.

                        Comment


                        • #13
                          Dear Mark,
                          Many many thanks! The fastacmd command worked like a bullet!!

                          I am also thankful to all others for their helpful suggestions.

                          Regards,
                          S

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          19 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X