Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq

    Hello everyone,

    I am new to next-gen sequencing and this forum. Hope someone can help me out here.

    To practice and test software tools for alignment, I downloaded a short reads dataset of a yeast genome and tried to convert the sanger-fastq format data to Maq’s BFQ ( I didn't know that SRA provides sanger-format fastq and MAQ prefer the other format of fastq).

    Command line I used was
    Code:
    maq fastq2bfq SRR002051.fastq SRR002051.bfq
    Part of warnings showed on the screen.
    Code:
    [seq_read_fastq] Inconsistent sequence name: ;E)$$$%%%%$%$""&"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: 32-)"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: *IDI*II%A;1+3&"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: $$,$"#&&%4&+$("""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: 6&%*I)''%11#"+-"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: 43&"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: (I#$,)B:E/(&"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: I5.=;&#!"-"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: """""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: (%%+%$/"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: $&/#2#&%!%"!"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: /%%!$#%*#"&"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: +6+/&%+&%$"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: +F)'$5*&+9%""+%"""""". Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: %%'"!"""""". Continue anyway.
    I checked the bfq format fastq file by converting it to sanger format fastq. It appears that fastq2bfq can not handle the symbol '@' contained in quality score lines in sanger format fastq files.

    I am wondering if anyone have already wrote a sanger-format fastq to illumina-format fastq cnoverter, it will be really helpful to me.

    Thanks.

  • #2
    You have already got Sanger style FASTQ files from the NCBI SRA, and MAQ likes standard Sanger FASTQ files. You would only need to convert if you started with Solexa or Illumina encoded FASTQ files

    Maybe the problem is something else - could you post the first 20 lines or so of the FASTQ file in the forum - use the [ code ] data [ /code ] tags to make it display nicely.

    Comment


    • #3
      It is not the '@' symbol that not allowed, it is an '@' which follows an illegal space in the description. Unfortunately, many of the fastq files are not properly formatted and contain spaces in the sequence name which causes maq to mess up. Clean up the sequence names and the tool will work.

      Comment


      • #4
        Originally posted by maubp View Post
        You have already got Sanger style FASTQ files from the NCBI SRA, and MAQ likes standard Sanger FASTQ files. You would only need to convert if you started with Solexa or Illumina encoded FASTQ files

        Maybe the problem is something else - could you post the first 20 lines or so of the FASTQ file in the forum - use the [ code ] data [ /code ] tags to make it display nicely.
        Thanks for replys.
        Here's the 20 lines of the FASTQ file I downloaded from SRA
        Code:
        @SRR002051.1 :8:1:325:773 length=33
        AAAGAACATTAAAGCTATATTATAAGCAAAGAT
        +SRR002051.1 :8:1:325:773 length=33
        IIIIIIIIIIIIIIIIIIIIIIIII'II@I$)-
        @SRR002051.2 :8:1:409:432 length=33
        AAGTTATGAAATTGTAATTCCAATATCGTAAGC
        +SRR002051.2 :8:1:409:432 length=33
        IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07
        @SRR002051.3 :8:1:488:490 length=33
        AATTTCTTACCATATTAGACAAGGCACTATCTT
        +SRR002051.3 :8:1:488:490 length=33
        IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I
        @SRR002051.4 :8:1:899:554 length=33
        AGATTTCTAATATGGTTAAGAAGCGAACTTTTT
        +SRR002051.4 :8:1:899:554 length=33
        IIIIIIIIIIIIIIIIIII?IIIIII<IIIIII
        @SRR002051.5 :8:1:464:463 length=33
        AAAGCAGCAGCACGTAGTTCTTCATCCTTCTTC
        +SRR002051.5 :8:1:464:463 length=33
        IIIIIIIIIIIIIIIIIIIIIIIFIIIIII%.I
        This is the first 20 lines of the FASTQ file that I converted from the BFQ file.
        Code:
        @SRR002051.1
        AAAGAACATTAAAGCTATATTATAAGCAAAGAT
        +
        :8:1:325:773``````=33IIIIIIIIIIII
        @I$)-
        NNNNNNNGTNAAGTTATGAAATTGTAATTCCAATATCGTAAGC
        +
        !!!!!!!5:!73``````=33IIIIIIIIIIII""""""""""
        @SRR002051.3
        AATTTCTTACCATATTAGACAAGGCACTATCTT
        +
        :8:1:488:490``````=33IIIIIIIIIIII
        @SRR002051.4
        AGATTTCTAATATGGTTAAGAAGCGAACTTTTT
        +
        :8:1:899:554``````=33IIIIIIIIIIII
        @SRR002051.5
        AAAGCAGCAGCACGTAGTTCTTCATCCTTCTTC
        +
        :8:1:464:463``````=33IIIIIIIIIIII
        Clearly, short read SRR002051.2 has both wrong sequence and incorrect quality scores. I checked several more reads which have '@' in quality scores, they have the same problem.
        Last edited by byb121; 12-22-2009, 03:38 AM.

        Comment


        • #5
          Looking at that, I think aaronh is right - MAQ doesn't like the descriptions after the identifiers. I would file a bug on MAQ.

          In the short term, you could convert this and remove the descriptions using another tool.

          e.g. In Biopython 1.51 or later using the SeqIO interface:
          Code:
          from Bio import SeqIO
          
          def remove_descr(records):
              """Iterate over SeqRecord objects clearing their description."""
              for rec in records :
                  rec.description = ""
                  yield rec
          
          records = remove_descr(SeqIO.parse(open("byb121_sra.fastq"), "fastq"))
          
          out_handle = open("byb121_maq.fastq", "w")
          count = SeqIO.write(records, out_handle, "fastq")
          out_handle.close()
          
          print "Converted %i records" % count

          Comment


          • #6
            Thanks a lot. Since it's a short-term practice anyway, I will just get rid of those spaces or perhaps everything after the space. It ls always good to know that I didn't do anything wrong

            If MAQ can fix the problem it'll be really really great.

            Comment


            • #7
              leaving the "+“ line (third-line) empty, the maq will parse this sequence.

              Before:
              Code:
              $cat  test.fastq
              @SRR228083.sra.1HWI-EAS158_0001:5:1:1089:19990length=36
              CACTTTGCGTAACGTACACTGGGNTCGCTGAANTAG
              +SRR228083.sra.1 HWI-EAS158_0001:5:1:1089:19990 length=36
              BBABB@B@<4:7:>:>2;3>;>?#@###########
              @SRR228083.sra.2HWI-EAS158_0001:5:1:1089:13103length=36
              GCGCGGTGGTCCCACCTGACCCCNTGCCGAACNCAG
              +SRR228083.sra.2 HWI-EAS158_0001:5:1:1089:13103 length=36
              CCCCCC@CA@C@CCC=BCAB>7@#@>?-@#######
              
              $maq  fastq2bfq  test.fastq  test.bfq                             
              [seq_read_fastq] Inconsistent sequence name: B@<4:7:>:>2;3>;>?#@###########. Continue anyway.
              [seq_read_fastq] Inconsistent sequence name: CA@C@CCC=BCAB>7@#@>?-@#######. Continue anyway.
              -- finish writing file 'test.bfq'
              -- 2 sequences were loaded.
              After:
              Code:
              $cat test.new.fastq
              @SRR228083.sra.1HWI-EAS158_0001:5:1:1089:19990length=36
              CACTTTGCGTAACGTACACTGGGNTCGCTGAANTAG
              +
              BBABB@B@<4:7:>:>2;3>;>?#@###########
              @SRR228083.sra.2HWI-EAS158_0001:5:1:1089:13103length=36
              GCGCGGTGGTCCCACCTGACCCCNTGCCGAACNCAG
              +
              CCCCCC@CA@C@CCC=BCAB>7@#@>?-@#######
              
              $maq  fastq2bfq test.new.fastq  test.new.bfq
              -- finish writing file 'test.bfq'
              -- 2 sequences were loaded.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              66 views
              0 likes
              Last Post seqadmin  
              Working...
              X