Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi guys,

    Thanks for this script! I used it successfully both on standard fastq as well as on casava format (with the appropriate changes in the script)

    But now I hit a wall again:

    I have been doing some error correction and this added some information to the header so afterwards the script does not work anymore.

    my data looks now like this:
    @PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
    NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
    +PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1
    BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
    TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
    +PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

    I tried to change the script in different ways, but it always results basically in the following error message:

    Traceback (most recent call last):
    File "/usit/titan/u1/chrishah/python/pair_up_reads.py", line 49, in <module>
    for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    File "/site/VERSIONS/python-2.6.2/lib/python2.6/site-packages/Bio/SeqIO/QualityIO.py", line 907, in FastqGeneralIterator
    raise ValueError("Sequence and quality captions differ.")
    ValueError: Sequence and quality captions differ.

    I could of course just remove the parts from the header that have been added by the errorcorrection tool and use the normal script...
    But I really think it should be easy to adjust the script so that it works for my data - I just dont know enough about Python...

    Can anyone give me a hint how to change the script? Any tips are highly appreciated!

    Much obliged!
    Christoph

    Comment


    • #17
      Originally posted by chrishah View Post
      Hi guys,

      Thanks for this script! I used it successfully both on standard fastq as well as on casava format (with the appropriate changes in the script)

      But now I hit a wall again

      I have been doing some error correction and this added some information to the header so afterwards the script does not work anymore.
      You should either edit BOTH the @ and the + lines like this,
      Code:
      @PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
      NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
      +PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
      BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
      @PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
      TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
      +PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
      or use a blank + line like this (which is recommended as the file is smaller):
      Code:
      @PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
      NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
      +
      BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
      @PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
      TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
      +
      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
      The Biopython parser is complaining about the inconsistency.

      Comment


      • #18
        Thanks for your rapid reply maubp!

        Have tried the second solution you suggested and it works fine with the standard version of the script!

        I will use this approach to proceed. Thanks again!

        take care,
        Christoph

        Comment


        • #19
          I have data generated from Illumina Hiseq 2000 which uses 1:N and 2:N as opposed to /1 and /2 as suffices.
          Ex:

          @D8CXHXP1:222:C0HVHACXX:2:1101:1836:2165 2:N:0:ACAGTG

          Do I just follow replacing those 2 lines of code with:

          f_suffix = ""
          r_suffix = ""

          Would it work then?? Sorry I don't understand

          Comment


          • #20
            Originally posted by vkartha View Post
            Would it work then?? Sorry I don't understand
            Yes, i.e.

            Code:
            from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later
            
            #######################################################
            #
            # Change the following settings to suit your needs
            #
            
            input_forward_filename = "SRR001666_1_rnd.fastq"
            input_reverse_filename = "SRR001666_2_rnd.fastq"
            
            output_pairs_filename = "out_interleaved_pairs.fastq"
            output_orphan_filename = "out_unpaired_orphans.fastq"
            
            f_suffix = ""
            r_suffix = ""
            #For older Illumina files use this instead:
            #f_suffix = "/1"
            #r_suffix = "/2"
            
            #######################################################
            
            if f_suffix:
                f_suffix_crop = -len(f_suffix)
                def f_name(title):
                    """Remove the suffix from a forward read name."""
                    name = title.split()[0]
                    assert name.endswith(f_suffix), name
                    return name[:f_suffix_crop]
            else:
                def f_name(title):
                    return title.split()[0]
            
            if r_suffix:
                r_suffix_crop = -len(r_suffix)
                def r_name(title):
                    """Remove the suffix from a reverse read name."""
                    name = title.split()[0]
                    assert name.endswith(r_suffix), name
                    return name[:r_suffix_crop]
            else:
                def r_name(title):
                    return title.split()[0]
            
            print "Scanning reverse file to build list of names..."    
            reverse_ids = set()
            paired_ids = set()
            for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                reverse_ids.add(r_name(title))
            
            print "Processing forward file..."
            forward_handle = open(output_paired_forward_filename, "w")
            orphan_handle = open(output_orphan_filename, "w")
            for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
                name = f_name(title)
                if name in reverse_ids:
                    #Paired
                    paired_ids.add(name)
                    reverse_ids.remove(name) #frees a little memory
                    forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                else:
                    #Orphan
                    orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
            forward_handle.close()
            del reverse_ids #frees memory, although we won't need more now
            
            print "Processing reverse file..."
            reverse_handle = open(output_paired_reverse_filename, "w")
            for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                name = r_name(title)
                if name in paired_ids:
                    #Paired
                    reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                else:
                    #Orphan
                    orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
            orphan_handle.close()
            reverse_handle.close()
            print "Done"
            A more rigorous script would verify the 1 or 2 as well.
            Last edited by maubp; 04-24-2012, 12:59 AM. Reason: Correcting messed up example - I'd copy & pasted the wrong bits from earlier posts.

            Comment


            • #21
              Hi everybody and thanks for supporting this forum. I'm running the latter script and I'm getting this error:

              Traceback (most recent call last):
              File "clean-fastQ.py", line 46, in <module>
              for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
              NameError: name 'FastqGeneralIterator' is not defined

              Any ideas for someone that is pretty new in bioinformatics??

              Regards

              Originally posted by maubp View Post
              Yes, i.e.

              Code:
              from Bio import SeqIO #Biopython 1.54 or later needed
              
              #######################################################
              #
              # Change the following settings to suit your needs
              #
              
              input_forward_filename = "SRR001666_1_rnd.fastq"
              input_reverse_filename = "SRR001666_2_rnd.fastq"
              
              output_pairs_filename = "out_interleaved_pairs.fastq"
              output_orphan_filename = "out_unpaired_orphans.fastq"
              
              f_suffix = ""
              r_suffix = ""
              #For older Illumina files use this instead:
              #f_suffix = "/1"
              #r_suffix = "/2"
              
              #######################################################
              
              if f_suffix:
                  f_suffix_crop = -len(f_suffix)
                  def f_name(name):
                      """Remove the suffix from a forward read name."""
                      assert name.endswith(f_suffix), name
                      return name[:f_suffix_crop]
              else:
                  #No suffix, don't need a function to fix the name
                  f_name = None
              
              if r_suffix:
                  r_suffix_crop = -len(r_suffix)
                  def r_name(name):
                      """Remove the suffix from a reverse read name."""
                      assert name.endswith(r_suffix), name
                      return name[:r_suffix_crop]
              else:
                  #No suffix, don't need a function to fix the name
                  r_name = None
              
              print "Scaning reverse file to build list of names..."    
              reverse_ids = set()
              paired_ids = set()
              for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                  reverse_ids.add(r_name(title))
              
              print "Processing forward file..."
              forward_handle = open(output_paired_forward_filename, "w")
              orphan_handle = open(output_orphan_filename, "w")
              for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
                  name = f_name(title)
                  if name in reverse_ids:
                      #Paired
                      paired_ids.add(name)
                      reverse_ids.remove(name) #frees a little memory
                      forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                  else:
                      #Orphan
                      orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
              forward_handle.close()
              del reverse_ids #frees memory, although we won't need more now
              
              print "Processing reverse file..."
              reverse_handle = open(output_paired_reverse_filename, "w")
              for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                  name = r_name(title)
                  if name in paired_ids:
                      #Paired
                      reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                  else:
                      #Orphan
                      orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
              orphan_handle.close()
              reverse_handle.close()
              print "Done"
              A more rigorous script would verify the 1 or 2 as well.

              Comment


              • #22
                Originally posted by borjito View Post
                Hi everybody and thanks for supporting this forum. I'm running the latter script and I'm getting this error:

                Traceback (most recent call last):
                File "clean-fastQ.py", line 46, in <module>
                for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                NameError: name 'FastqGeneralIterator' is not defined

                Any ideas for someone that is pretty new in bioinformatics??

                Regards
                That's a typical Python error (exception), it is saying 'FastqGeneralIterator' has not been defined. In this case I think you are missing the 'import' line which would have loaded FastqGeneralIterator from Biopython (v1.51 or later). i.e.

                from Bio.SeqIO.QualityIO import FastqGeneralIterator

                Update: It looks like when I wrote that re-cap post I must have slipped up.
                Last edited by maubp; 04-23-2012, 08:22 AM.

                Comment


                • #23
                  Many thanks! So now the exception disappeared, but I found this new error:

                  $ python clean-fastQ.py
                  Scaning reverse file to build list of names...
                  Traceback (most recent call last):
                  File "clean-fastQ.py", line 46, in <module>
                  reverse_ids.add(r_name(title))
                  TypeError: 'NoneType' object is not callable

                  I'm wondering whether this error comes from the file format; this is an example of the files I'm trying to "clean":

                  @HTKZQN1:399:C0JJVACXX:5:1101:1680:2187 2:N:5122:AGTCAA
                  ACTAGTATGGCCCGGGGGATCCTAGAGACCATTCGCGATTCCATGAGACTCCAAGGGTTCTGCACAACTTATGCACCTCTATTAGATCATTGTGTTCTACG
                  +
                  CCCDFDFFHGHHHIJJJJDHIJJIJJIIIIIIIJJJHDDEBCEECCCDDDCDDDDDDAADDDDCCDDBDDDDAAACCCDCACDDCCDDDCDCDDCCDDCCD

                  Many thanks in advance

                  Borja

                  Comment


                  • #24
                    Originally posted by borjito View Post
                    Many thanks! So now the exception disappeared, ...
                    Yep - I messed up that recap post. Updating it now... sorry.

                    Comment


                    • #25
                      Thanks again, I've also changed the

                      output_pairs_filename = "out_interleaved_pairs.fastq"

                      at the beggining by

                      output_paired_forward_filename = "IPLA2_1fc.fq"
                      output_paired_reverse_filename = "IPLA2_2fc.fq"

                      o create 2 files, as defined in the script, and it works perfectly! I've used the two outputs as input for SOAPdenovo and got a notable decrease in computation time.

                      Many thanks!!

                      Comment


                      • #26
                        Just wondering if anyone has a script to do the opposite to the scripts in this thread, ie. split an already interleaved forward and reverse file into separate forward and reverse files (for Casava 1.8 and beyond read naming)?

                        Comment


                        • #27
                          Looks like you have some answers here, but I've used this and it works wonderfully well: https://github.com/ucdavis-bioinformatics/sickle

                          It won't make the interleaved, instead it retains the split left/right files with a new single end file. But you can easily use other scripts to create the interleaved file if you must.

                          Comment


                          • #28
                            Thanks for quick reply. Actually, I'm looking for something even simpler, where I just have one interleaved fastq file that I want to split into separate forward and reverse. Probably should have posted in a separate thread since I'm not so worried about trimming in this case.

                            I'm sure it is some very simple python that I could manage in the end, such a script is probably already out there.

                            Comment


                            • #29
                              Hi Matt,

                              Just googling around I see this post with a script that claims to do such a thing, I've never had to do this type of thing however: http://bioinfomemaid.blogspot.com/

                              Comment


                              • #30
                                Ah, that looks promising, will give it a go. Don't know why my google search didn't hit upon that one!

                                Thanks!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                31 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X