Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract subset of Fastq sequences based on a list of IDs

    Does anyone has a script to extract a subset of fastq sequences based on a list of IDs?
    similar to the "choose_FASTAs_from_list" script from scriptome but for fastq files

    thank you very much

  • #2
    Here's an SFF example copied from the Biopython Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.htm
    Code:
    from Bio import SeqIO
    input_file = "big_file.sff"
    id_file = "short_list.txt"
    output_file = "short_list.sff"
    wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
    print "Found %i unique identifiers in %s" % (len(wanted), id_file)
    records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
    count = SeqIO.write(records, output_file, "sff")
    print "Saved %i records from %s to %s" % (count, input_file, output_file)
    if count < len(wanted):
        print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
    Switch a few format names, and we get a FASTQ version:
    Code:
    from Bio import SeqIO
    input_file = "big_file.fastq"
    id_file = "short_list.txt"
    output_file = "short_list.fastq"
    wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
    print "Found %i unique identifiers in %s" % (len(wanted), id_file)
    records = (r for r in SeqIO.parse(input_file, "fastq") if r.id in wanted)
    count = SeqIO.write(records, output_file, "fastq")
    print "Saved %i records from %s to %s" % (count, input_file, output_file)
    if count < len(wanted):
        print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
    The IDs are taken from a text file, one per line. If your file is tab separated, the first entry of each line is used.

    Probably fast enough for a one off task, if this is something you'll be doing regularly there are likely quicker solutions.

    P.S. I've got Python scripts to do this and related tasks within Galaxy, checkout http://toolshed.g2.bx.psu.edu/ and in particular http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id
    Last edited by maubp; 05-06-2013, 02:18 AM. Reason: Adding Tool Shed URL

    Comment


    • #3
      You can do this with Biopieces (www.biopieces.org) like this:

      First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:

      Code:
      read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq
      Check out grab for details.


      Cheers,


      Martin

      Comment


      • #4
        Thank you for your quick reply Maubp
        I tried saving what you wrote into a file called ExtractFastq.py and ran it but I guess it was not a complete script since I got the following error. (sorry I am a begginer)

        $ python ExtractFastq.py
        Traceback (most recent call last):
        File "ExtractFastq.py", line 2, in <module>
        input_file = TrepoSinFilt.fastq
        NameError: name 'TrepoSinFilt' is not defined

        For the suggestion of Galaxy I am uploading the file, however it has 15Gb so I am not sure that with the speed of my internet I will be able to achieve it.

        Comment


        • #5
          Originally posted by maasha View Post
          You can do this with Biopieces (www.biopieces.org) like this:

          First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:

          Code:
          read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq
          Check out grab for details.


          Cheers,


          Martin
          Thank you for your quick reply Maasha, I am trying to install Biopieces, however I cannot run the test commands, it says
          zsh: command not found: bp_test

          I already checked that I have all the perl modules

          any suggestions?

          Comment


          • #6
            Follow the Installation instructions carefully. Also, it looks like you are running zsh. You have to run bash for Biopieces to work.

            Comment


            • #7
              Originally posted by pepperoni View Post
              Thank you for your quick reply Maubp
              I tried saving what you wrote into a file called ExtractFastq.py and ran it but I guess it was not a complete script since I got the following error. (sorry I am a begginer)

              $ python ExtractFastq.py
              Traceback (most recent call last):
              File "ExtractFastq.py", line 2, in <module>
              input_file = TrepoSinFilt.fastq
              NameError: name 'TrepoSinFilt' is not defined
              Probably you just need to put quotes round the filename as in the example, that's how you define a string in Python. i.e.
              Code:
              ...
              input_file = "TrepoSinFilt.fastq"
              ...
              You can also use single quotes rather than the double quote characters.

              Originally posted by pepperoni View Post
              For the suggestion of Galaxy I am uploading the file, however it has 15Gb so I am not sure that with the speed of my internet I will be able to achieve it.
              Uploading something that large to the Penn State Galaxy might be hard, also they might have started imposing user disk quotas now...

              [update] ... and the tool I'm talking about isn't on the Penn State Galaxy anyway, its an optional extra from the Tool Shed which can be added to a local Galaxy.
              Last edited by maubp; 10-13-2011, 10:56 AM.

              Comment


              • #8
                Actually, for 15Gb of data, you probably will care about the speed even for a one off - this should be about 5x faster as explained here (if you care about the Python side of things):


                Something like this:
                Code:
                from Bio.SeqIO.QualityIO import FastqGeneralIterator
                
                input_file = "big_file.fastq"
                id_file = "short_list.txt"
                output_file = "short_list.fastq"
                
                wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
                print "Found %i unique identifiers in %s" % (len(wanted), id_file)
                
                count = 0
                handle = open(output_file, "w")
                for title, seq, qual in FastqGeneralIterator(open(input_file)) :
                    if title.split(None,1)[0] in wanted:
                        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                        count += 1
                handle.close()
                
                print "Saved %i records from %s to %s" % (count, input_file, output_file)
                if count < len(wanted):
                    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
                A time comparison with BioPieces would be interesting if you can get both methods to work

                Please include installation and trouble shooting time as well

                Comment


                • #9
                  cdbfasta -Q and cdbyank are the best solutions I have found. Fast!

                  But also, would not the GNU version of grep work since it allows you to return a number of lines after the line that matches with the -A parameter?

                  --
                  Phillip

                  Comment


                  • #10
                    How would you supply a potentially very long list of IDs to grep?

                    Comment


                    • #11
                      Originally posted by maubp View Post
                      Actually, for 15Gb of data, you probably will care about the speed even for a one off - this should be about 5x faster as explained here (if you care about the Python side of things):


                      Something like this:
                      Code:
                      from Bio.SeqIO.QualityIO import FastqGeneralIterator
                      
                      input_file = "big_file.fastq"
                      id_file = "short_list.txt"
                      output_file = "short_list.fastq"
                      
                      wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
                      print "Found %i unique identifiers in %s" % (len(wanted), id_file)
                      
                      count = 0
                      handle = open(output_file, "w")
                      for title, seq, qual in FastqGeneralIterator(open(input_file)) :
                          if title.split(None,1)[0] in wanted:
                              handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                              count += 1
                      handle.close()
                      
                      print "Saved %i records from %s to %s" % (count, input_file, output_file)
                      if count < len(wanted):
                          print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
                      A time comparison with BioPieces would be interesting if you can get both methods to work

                      Please include installation and trouble shooting time as well


                      Now it has worked but I get this error:
                      alejandra@alejandra-desktop[DocumentosKNoCupieron] python ExtractFastq2.py
                      Found 3125101 unique identifiers in alignedTrepofa.lst
                      Traceback (most recent call last):
                      File "ExtractFastq2.py", line 12, in <module>
                      for title, seq, qual in FastqGeneralIterator(open(input_file)) :
                      File "/usr/lib/pymodules/python2.6/Bio/SeqIO/QualityIO.py", line 920, in FastqGeneralIterator
                      % (title_line, seq_len, len(quality_string)))
                      ValueError: Lengths of sequence and quality values differs for Corrida1:1117_10_107/1 (49 and 73).

                      I guess that it is because of the conversion from csfasta & qual to fastq. I used the solid2fastq from maq.
                      do you think that it may work to change my file to fastq sanger in galaxy before running the script?
                      On the other hand I already installed galaxy but in which directory shall I clone your tool from the galaxy tool shed?

                      thanks

                      Comment


                      • #12
                        Biopython expects FASTQ records where seq and qual are the same length. Galaxy accepts colour space FASTQ where the seq is one letter longer than the qual string (the nucleotide before the colour changes doesn't get a quality score for some reason). My Galaxy tool should cope with that situation by using the Galaxy library - but the examples above won't.

                        However, the error message says the sequence is length 49 and the quality is 73, which does not make sense at all. Can you show use some of the file - the bit including record Corrida1:1117_10_107/1 in particular?

                        My guess is something has gone wrong in the solid to fastq conversion.

                        Comment


                        • #13
                          Here is a bit of the file:

                          @Corrida1:1117_10_107/1
                          GNNNANNATNCGANGNNNTNTAANNAANTNGNNTCNGATNTNNNCNNAT
                          +
                          @-"-"-">-"-"5A-"3-"'-"-"-"?-"5.(-"-"6?-"9-"7-"-"7/-"-;6-"&-"-"-"5-"-"-=
                          @Corrida1:1117_10_146/1
                          ANNNGNNCANTATNGNNNGNCCANNCCNANCNNTGNATTNCNNNGNNTT
                          +
                          B-"-"-">-"-"==-"(1&-"?-"-"-"/-"%.%-"-",.-"&-"(-"-"+1-"%-0-")-"-"-"1-"-"+%
                          @Corrida1:1117_10_1017/1
                          GNNNCNNTANGCANTNNNGNACTNNACNGNANNGTNGTTNGNNNANNTT
                          +
                          B-"-"-"<-"-"21-"=9,-"'-"-"-">-")>9-"-"))-"5-".-"-").-"=+9-"+-"-"-"%-"-"('
                          @Corrida1:1117_11_136/1
                          ANNNTNNTANGATNGNNNANGTGNNTCNGNTGNGGNGGGNTNNNANNAT
                          +
                          7-"-"-"B-"-"?<-"<=A-">-"-"-"<-"3*'-"-"81-";-"&(-"7(-"3/)-")-"-"-",-"-"%9

                          Comment


                          • #14
                            Originally posted by pepperoni View Post
                            Here is a bit of the file:
                            Great, with the [ code ] tags (via the # button in the forum's advanced editor) we get this, without the face:
                            Code:
                            @Corrida1:1117_10_107/1
                            GNNNANNATNCGANGNNNTNTAANNAANTNGNNTCNGATNTNNNCNNAT
                            +
                            @-"-"-">-"-"5A-"3:(-"'-"-"-"?-"5.(-"-"6?-"9-"7-"-"7/-"-;6-"&-"-"-"5-"-"-=
                            @Corrida1:1117_10_146/1
                            ANNNGNNCANTATNGNNNGNCCANNCCNANCNNTGNATTNCNNNGNNTT
                            +
                            B-"-"-">-"-"==-"(1&-"?-"-"-"/-"%.%-"-",.-"&-"(-"-"+1-"%-0-")-"-"-"1-"-"+%
                            @Corrida1:1117_10_1017/1
                            GNNNCNNTANGCANTNNNGNACTNNACNGNANNGTNGTTNGNNNANNTT
                            +
                            B-"-"-"<-"-"21-"=9,-"'-"-"-">-")>9-"-"))-"5-".-"-").-"=+9-"+-"-"-"%-"-"('
                            @Corrida1:1117_11_136/1
                            ANNNTNNTANGATNGNNNANGTGNNTCNGNTGNGGNGGGNTNNNANNAT
                            +
                            7-"-"-"B-"-"?<-"<=A-">-"-"-"<-"3*'-"-"81-";-"&(-"7(-"3/)-")-"-"-",-"-"%9
                            Does that match what you see in the file, or has it got changed during posting to the forum?

                            It is very clear that the sequence and quality strings are not the same length. There is also a surprising number of minus signs and quotes in the quality strings. I thought at first there was an extra minus sign between every real quality character, but that doesn't quite explain it. Something is very wrong though.

                            Also your sequences look like nucleotides, could you confirm this is meant to be color-space data?

                            Could you show us the equivalent bits of the original color-space FASTA and QUAL files?
                            Last edited by maubp; 10-14-2011, 02:50 AM. Reason: typo

                            Comment


                            • #15
                              yes the file is exactly like that (without the happy face)
                              These are the original files

                              # Title: Corrida_16_01RMDSPFR004_1
                              >1117_10_107_F3
                              T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
                              >1117_10_146_F3
                              T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
                              >1117_10_1017_F3
                              T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
                              >1117_11_136_F3
                              T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03

                              # Title: Corrida_16_01RMDSPFR004_1
                              >1117_10_107_F3
                              23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28
                              >1117_10_146_F3
                              20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4
                              >1117_10_1017_F3
                              33 33 -1 -1 -1 27 -1 -1 17 16 -1 28 24 11 -1 6 -1 -1 -1 29 -1 8 29 24 -1 -1 8 8 -1 20 -1 13 -1 -1 8 13 -1 28 10 24 -1 10 -1 -1 -1 4 -1 -1 7 6
                              >1117_11_136_F3
                              16 22 -1 -1 -1 33 -1 -1 30 27 -1 27 28 32 -1 29 -1 -1 -1 27 -1 18 9 6 -1 -1 23 16 -1 26 -1 5 7 -1 22 7 -1 18 14 8 -1 8 -1 -1 -1 11 -1 -1 4 24

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X