Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • converting consensus fastq to fasta

    I was trying to convert fastq consensus sequneces (generated by the pileup2fq utility of samtools) to a multi-fasta file with the MAQ's fq_all2std.pl fq2fa script but it didn't work. The result is a mix of 60bp sequence and quality lines with the > headers.

    Is there quick way to extract only the sequences from the fastq file?

    Thank you.

  • #2
    Originally posted by zlu View Post
    I was trying to convert fastq consensus sequneces (generated by the pileup2fq utility of samtools) to a multi-fasta file with the MAQ's fq_all2std.pl fq2fa script but it didn't work. The result is a mix of 60bp sequence and quality lines with the > headers.

    Is there quick way to extract only the sequences from the fastq file?

    Thank you.
    I personally use Biopython, e.g.



    Code:
    from Bio import SeqIO
    count = SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta")
    print "Converted %i reads" % count
    You might also consider BioPerl or EMBOSS seqret, etc.

    However, the problem could be a broken FASTQ file. Could you show use the start of the file (say two reads)?

    Comment


    • #3
      Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

      @chr1
      nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
      GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
      ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
      GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
      AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
      TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
      TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
      +
      !!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
      ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
      HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
      QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
      olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
      EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
      NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

      And using the method described above, this is what I got:

      >chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
      >`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
      Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
      >QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
      rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
      >QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
      louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
      >KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
      WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
      >QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
      QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
      >QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
      QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
      >LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
      ~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
      >QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
      BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
      >NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
      WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT

      Comment


      • #4
        Are the sequences and quality lines in the original FASTQ line wrapped? This is valid but not recommended as some tools (perhaps including MAQ) don't understand it.

        The tools I suggested can all cope with line wrapped FASTQ. See also: http://dx.doi.org/10.1093/nar/gkp1137

        (Or it may be the forum making it look line wrapped and adding spaces when it isn't really - trying using the [ code ] text [ /code ] tags to stop this)
        Last edited by maubp; 12-18-2009, 04:37 AM.

        Comment


        • #5
          yes, the lines are indedd wrapped. Thanks for the reminder.

          However, trying seqret of EMBOSS 6.1, I got the following error:

          [SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
          Reads and writes (returns) sequences
          Input (gapped) sequence(s): CNS.fq
          Error: Unable to read sequence CNS.fq'

          Comment


          • #6
            Originally posted by zlu View Post
            yes, the lines are indedd wrapped. Thanks for the reminder.
            That probably does explain MAQ's failure.
            Originally posted by zlu View Post
            However, trying seqret of EMBOSS 6.1,
            Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes. See:
            ftp://emboss.open-bio.org/pub/EMBOSS/fixes/README.fixes

            Originally posted by zlu View Post
            I got the following error:

            [SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
            Reads and writes (returns) sequences
            Input (gapped) sequence(s): CNS.fq
            Error: Unable to read sequence CNS.fq'
            That is odd. Note you can also try giving the filename as part of the command line, e.g.

            $ seqret -sformat fastq-sanger -sequence CNS.fq -osformat fasta -outseq CNS.fasta

            How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked.

            Comment


            • #7
              Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.

              Comment


              • #8
                Originally posted by maubp View Post
                Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes.
                I'll try to patch it and see how it goes.

                Originally posted by maubp View Post
                How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked
                The compressed file is more than 1.3GB.

                Thanks for your help. I'll try to process the fastq with a script.

                Comment


                • #9
                  Using the patched emboss, I can now convert the line wrapped fastq files. Thank you.

                  Comment


                  • #10
                    Originally posted by lh3 View Post
                    Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.
                    Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?

                    Comment


                    • #11
                      Originally posted by maubp View Post
                      Originally posted by lh3 View Post
                      Most of fastq parsers do not work with multi-line fastq files.
                      Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?
                      From the Wikipedia article on FASTQ (http://en.wikipedia.org/wiki/FASTQ_format):

                      The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).

                      Comment


                      • #12
                        Just because the consensus is that line wrapping in FASTQ is/was a bad idea, doesn't mean it can't be reliably parsed. It does make the code a little more complicated I admit - but as this thread shows, it would be useful to some if MAQ could understand line-wrapped FASTQ.

                        Comment


                        • #13
                          The version in MAQ svn, which is never released, parses multi-line fastq and accepts gzipped input, as it uses the same parser as bwa. Nonetheless, as maq only works with reads no longer than 127bp, not supporting multi-line fastq is not a major problem. It is more important for bwa to parse multi-line fastq as it works with long reads.

                          Comment


                          • #14
                            Originally posted by zlu View Post
                            Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

                            @chr1
                            nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
                            GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
                            ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
                            GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
                            AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
                            TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
                            TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
                            +
                            !!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
                            ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
                            HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
                            QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
                            olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
                            EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
                            NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

                            And using the method described above, this is what I got:

                            >chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
                            >`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
                            Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
                            >QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
                            rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
                            >QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
                            louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
                            >KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
                            WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
                            >QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
                            QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
                            >QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
                            QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
                            >LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
                            ~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
                            >QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
                            BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
                            >NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
                            WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT
                            awk '/@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\+//" > chr1.fasta

                            here "chr1" is chromosome

                            Comment


                            • #15
                              My question is related, but not quite the same, which is: I need to get a sub-string of the sequence and the corresponding quality score for each of the entries, the file format untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

                              The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.
                              Code:
                              use Bio::SeqIO; use Bio::Seq::Quality;
                              
                              $seqio = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');
                              
                              my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');
                              
                              while((my $line = $seqio->next_seq() ) { 
                              # keep the id 
                              # substring the sequence (from 11 to 66) 
                              # substring the quality score (from 11 to 66) $out_fastq->write_seq($line); }
                              Or any tools there to do my job?

                              Thanks a lot!

                              Yifang
                              Last edited by yifangt; 08-13-2011, 04:44 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X