Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Convert fastq from NCBI SRA to fasta and qual?

    Hi all,

    I have downloaded some 454 and Illumina data from the NCBI SRA that is in .fastq format.

    Example 454 data:
    Code:
    @SRR000072.1 ERBRDQF01EGP9U length=67
    TAATGTGCTTTTCTATAGACAGTCCATTTTCAGGGATATTTTCCAAACTGTCTGGACTGTCTATAGA
    +SRR000072.1 ERBRDQF01EGP9U length=67
    <?:<<<<;>=2"<<<<<<<:<;<;5<??7+<<<:';<<>=3#=7<:(<;<<<<@;;<;<<<:;<<<<
    I can't figure out how to convert this data to .fasta and .qual. I checked out this page: http://www.bugaco.com/converter/biol...nces/index.php but I don't know which type of fastq data it is. At any rate, none of the fastq to fasta converters worked. I also tried the fastq_to_fasta program from this site http://hannonlab.cshl.edu/fastx_tool...mmandline.html but it didn't work either. Any assistance in finding a linux-based conversion tool would be greatly appreciated!

    Thank you!
    Kevin

  • #2
    For explaination of differences within the fastq format take a look at this thread: http://seqanswers.com/forums/showthread.php?t=3271

    I would recommend you to start looking at the fq_all2std.pl script in MAQ.

    Personally, I prefer to use Python for this kind of tasks. for more info see this short article:
    O|B|F News: Working with FASTQ files in Biopython when speed matters

    Comment


    • #3
      Any FASTQ file from the NCBI SRA seems to already be in the standard Sanger FASTQ format (even if originally from a Solexa/Illumina machine it has been converted). See:


      As Andreas has suggested, you could use Biopython to do FASTQ -> QUAL and FASTQ -> FASTA, these can be done with trivial two line scripts using Biopython 1.52 or later:

      Comment


      • #4
        Originally posted by kmkocot View Post
        I checked out this page: http://www.bugaco.com/converter/biol...nces/index.php but ... none of the fastq to fasta converters worked.
        I just commented on the other thread, that looks like a bug in the website with some characters in FASTQ quality strings:
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        Comment


        • #5
          Thanks all! I will look into the biopython scripts.

          Comment


          • #6
            I have BioPython 1.49 and I found the utilities I need but I'm having trouble. I'm using Ubuntu Linux. I started python from a terminal while in the same directory as my desired input fastq file. I tried the tests recommended in the manual but I'm not sure if they are working correctly. import Bio returned nothing (normal, right?) but print Bio.__version__ returned a syntax error (as did print Bio.149, print Bio.1.49, etc). It wasn't clear to me what I should actually type. Is my version too old?

            Here's the code I used and messages printed to standard output:
            Code:
            >>> from Bio import SeqIO
            >>> SeqIO.convert("output.fasta", "fasta", "Biomphalaria_glabrata_454.fastq", "fastq")
            Traceback (most recent call last):
              File "<stdin>", line 1, in <module>
            AttributeError: 'module' object has no attribute 'convert'
            Is my version of BioPython too old? I couldn't for the life of me figure out how to install it without synaptic. It kept saying it couldn't find Python.h.

            Thanks,
            Kevin

            Comment


            • #7
              Yes, Biopython 1.49 is too old. You need at least Biopython 1.51 for FASTQ support, and at least Biopython 1.52 for the Bio.SeqIO.convert function:


              Which version of Ubuntu are you using? I'm guessing jaunty from this listing:


              I install Biopython from source on Ubuntu (I currently use Karmic, but used to use Dapper before that which is really old now).

              You need to install the build dependencies, for example the python-dev package which will include the header files like Python.h which you are currently missing. As described on http://biopython.org/wiki/Download#Ubuntu_or_Debian try this first:

              sudo apt-get build-dep python-biopython

              P.S. Once you have this installed, the Bio.SeqIO.convert function takes the input file and format then the output file and format. Your attempted example seems to have this the wrong way round.
              Last edited by maubp; 01-23-2010, 06:43 AM.

              Comment


              • #8
                Here is a script that you can place in your bin/ directory:

                Code:
                #!/usr/bin/env python
                
                """
                Convert single FASTAQ files to FASTA + QUAL file pairs
                http://seqanswers.com/forums/showthread.php?t=3730
                
                You can use this script from the shell like this::
                $ ./fastaq_to_fasta reads.fastq reads.fna reads.qual
                """
                
                # The libraries we need #
                import sys, os
                from Bio import SeqIO
                # Get the shell arguments #
                fq_path = sys.argv[1]
                fa_path = sys.argv[2]
                qa_path = sys.argv[3]
                # Check that the path is valid #
                if not os.path.exists(fq_path): raise Exception("No file at %s." % fa_path)
                # Do it #
                SeqIO.convert(fq_path, "fastq", qa_path, "qual")
                SeqIO.convert(fq_path, "fastq", fa_path, "fasta")

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                26 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                42 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                42 views
                0 likes
                Last Post seqadmin  
                Working...
                X