Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract multiple fasta sequences from a fasta file based on sequenes

    I've searched in the forum and google as well, but most of cases are extract sequences from fasta file based on id list. but now I only have several sequences without ids and i want to extract from a fasta file based on these sequences, does anyone have clues for it? Actually, that means I need the sequences identifier (or id) from the fasta file.

    Thank you very much for your help.

  • #2
    You can extract sequences that share kmers with your sequences with BBDuk:

    Code:
    bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
    This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

    You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

    Comment


    • #3
      There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

      In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

      It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

      I'm sure some folks might have some grep based methods for the commandline as well

      Comment


      • #4
        3xs for the info.


        Originally posted by Brian Bushnell View Post
        You can extract sequences that share kmers with your sequences with BBDuk:

        Code:
        bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
        This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

        You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

        Comment


        • #5
          Actually, I do prefer command based under shell environment. grep seems a great way to do it, thanks for tips.


          Originally posted by cmbetts View Post
          There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

          In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

          It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

          I'm sure some folks might have some grep based methods for the commandline as well

          Comment


          • #6
            Here are my Python scripts to do this, with Galaxy wrappers:

            This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
            Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


            This indexes the FASTA file, then goes through the IDs one by one:
            Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


            The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

            Comment


            • #7
              Seems these script also deal with a list of ids, then using these ids to fetch sequences in a fasta file. I wanna use sequences directly and get a subset of fasta from a big fasta file base on provided sequences. Thank you for your information as well.


              Originally posted by maubp View Post
              Here are my Python scripts to do this, with Galaxy wrappers:

              This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


              This indexes the FASTA file, then goes through the IDs one by one:
              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


              The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

              Comment


              • #8
                What scripting/programming language(s) are you learning?

                Comment


                • #9
                  Actually, I do more wet lab most of the time. Sometimes, I'll also do simple small rna analysis with ready-to-use perl script and simple shell script. It's enough most of the time with my work, but sometime it's not easy for me.

                  Originally posted by maubp View Post
                  What scripting/programming language(s) are you learning?

                  Comment


                  • #10
                    Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

                    Comment


                    • #11
                      No worry, I'll try to use grep to deal with the problem .
                      Originally posted by maubp View Post
                      Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

                      Comment


                      • #12
                        Brian's solution should work. Did you try it?

                        While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

                        Comment


                        • #13
                          Yes, I've tried bbduk.sh.

                          bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

                          But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa).

                          since my b.fa is not a fasta file, so bbduk.sh give some error:

                          Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text.


                          Originally posted by GenoMax View Post
                          Brian's solution should work. Did you try it?

                          While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

                          Comment


                          • #14
                            If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

                            Code:
                            $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
                            Then use the file with BBDuk.

                            Comment


                            • #15
                              Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta

                              original.fas
                              >1123-11234
                              aaaaaa
                              >wer
                              atgcca
                              >ad
                              ctaacg
                              >232-23424
                              tttttt
                              >323-342
                              cacaaa
                              >416-2
                              gggggg
                              >13424241234-23423
                              cccccc
                              >5-234
                              cggcgtcacgttggttgttga


                              ref.fas(after I make fasta using your awk script)
                              >1
                              aaaaaa
                              >2
                              tttttt
                              >3
                              gggggg
                              >4
                              cccccc

                              I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21"

                              out.fas is like this
                              >5-234
                              cggcgtcacgttggttgttga

                              actually, I want a fasta like this

                              >1123-11234
                              aaaaaa
                              >232-23424
                              tttttt
                              >416-2
                              gggggg
                              >13424241234-23423
                              cccccc

                              Just like fetch the id from the original.fas


                              Originally posted by GenoMax View Post
                              If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

                              Code:
                              $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
                              Then use the file with BBDuk.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X