Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    This is irrelevant now that the question is answered, but I can reproduce the "can't find getline..." error mentioned above that resulted from my first script.

    The simple solution is to put "use IO::File;" at the top of the script. Apparently the behavior of Perl's I/O has changed (I'm using 5.22). Also, there are often OS-specific differences with scripting languages because the "perl" (or whatever) that the vendor distributes is not the same as what you compile yourself. So, it's possible there could be other tweaks required based on the system/versions, but it would be kind of a waste to go down that path since the problem is solved. Okay, now it's time to move on back to work...

    Comment


    • #32
      only those sequences with a "-" in the name ?

      filtentr original 1 - > k1

      [filters sequences (including name-lines) with - in the first entry]

      names k1 5 5 > result
      [only the raw sequence data, omitting the names]

      using selfwritten simple programs filtentr.c and names.c

      Comment


      • #33
        Admire you with the enthusiasm on programming, which makes an excellent programmer. Different running environment makes tons of unexpected problems that's why a robust code is really not easy to produce .

        Originally posted by SES View Post
        This is irrelevant now that the question is answered, but I can reproduce the "can't find getline..." error mentioned above that resulted from my first script.

        The simple solution is to put "use IO::File;" at the top of the script. Apparently the behavior of Perl's I/O has changed (I'm using 5.22). Also, there are often OS-specific differences with scripting languages because the "perl" (or whatever) that the vendor distributes is not the same as what you compile yourself. So, it's possible there could be other tweaks required based on the system/versions, but it would be kind of a waste to go down that path since the problem is solved. Okay, now it's time to move on back to work...

        Comment


        • #34
          Yes, this time it works though the output is not as expected .

          more out.fas
          >1123-11234
          --
          gggggg
          >13424241234-23423
          >1123-11234
          aaaaaa
          ctaacg
          >232-23424
          >232-23424
          tttttt
          tttttt
          >323-342
          >416-2
          gggggg
          cacaaa
          >416-2
          >13424241234-23423
          cccccc

          Originally posted by GenoMax View Post
          Are you running bash shell? If you are not then try explicitly going into bash like this

          Code:
          $ /bin/bash
          $ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

          Comment


          • #35
            Another good method, thank you.

            Originally posted by gsgs View Post
            only those sequences with a "-" in the name ?

            filtentr original 1 - > k1

            [filters sequences (including name-lines) with - in the first entry]

            names k1 5 5 > result
            [only the raw sequence data, omitting the names]

            using selfwritten simple programs filtentr.c and names.c

            Comment


            • #36
              So, I was inspired by this thread to add something into BBTools that could accomplish this. Thus, there's yet another method, "filterbysequence.sh". Usage:

              Code:
              filterbysequence.sh in=a.fasta ref=b.fasta out=c.fasta
              c.fasta will then contain all sequences shared by a.fasta and b.fasta. It supports case-matching or case-insensitive operation, and reverse-complement-aware or forward-only operation. And it can either do an exclusion or inclusion filter. Also, it can optionally reduce very large sequences down to their 128-bit hash-codes for low-memory operation (so, for example, you could easily filter sequences against nt or RefSeq microbial quickly in a small amount of memory to see if they are already present before adding yet another copy of E.coli, which is something NCBI absolutely needs to do). And it's very, very fast.
              Last edited by Brian Bushnell; 12-19-2015, 10:21 AM.

              Comment


              • #37
                Thank you for your great work. I've tried the script with below two fasta files.

                a.fasta

                >1
                aaaaaa
                >2
                tttttt
                >3
                gggggg
                >4
                cccccc



                b.fasta

                >1123-11234
                aaaaaa
                >wer
                atgcca
                >ad
                ctaacg
                >232-23424
                tttttt
                >323-342
                cacaaa
                >416-2
                gggggg
                >13424241234-23423
                cccccc
                >5-234
                cggcgtcacgttggttgttga


                running the script
                filterbysequence.sh in=a.fasta ref=b.fas out=c.fasta ow=true

                I've got c.fasta exactly same with a.fasta, is it supposed to replace identifier of the sequences in a.fasta from b.fasta?




                Originally posted by Brian Bushnell View Post
                So, I was inspired by this thread to add something into BBTools that could accomplish this. Thus, there's yet another method, "filterbysequence.sh". Usage:

                Code:
                filterbysequence.sh in=a.fasta ref=b.fasta out=c.fasta
                c.fasta will then contain all sequences shared by a.fasta and b.fasta. It supports case-matching or case-insensitive operation, and reverse-complement-aware or forward-only operation. And it can either do an exclusion or inclusion filter. Also, it can optionally reduce very large sequences down to their 128-bit hash-codes for low-memory operation (so, for example, you could easily filter sequences against nt or RefSeq microbial quickly in a small amount of memory to see if they are already present before adding yet another copy of E.coli, which is something NCBI absolutely needs to do). And it's very, very fast.

                Comment


                • #38
                  Yep, it keeps the sequences in a.fasta that match sequences in b.fasta, and retains the names. You could alternatively run "filterbysequence.sh in=b.fasta ref=a.fasta out=c.fasta" to keep the names from b.fasta.

                  Comment


                  • #39
                    Yes, that's does work. And it's more flexible and robust with different situations. Thank you for the great work. I've got a lot of fantastic script in bbmap!

                    Originally posted by Brian Bushnell View Post
                    Yep, it keeps the sequences in a.fasta that match sequences in b.fasta, and retains the names. You could alternatively run "filterbysequence.sh in=b.fasta ref=a.fasta out=c.fasta" to keep the names from b.fasta.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X