Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jddavis
    Junior Member
    • May 2013
    • 7

    converting s_*_sequence.txt file to Fasta

    Dear All:

    This is a simple question, but I'm new to bioinformatics and would love to have some help converting file types so that google and I are not "going it alone"...

    Here is the file header/format (from a very old sequencing file):

    @BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
    TGGTGCAAAATATGAAGTCAATAAGATTAAAATAAA
    +BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZU

    all are saved as s_*_sequence.txt where *=some number...what I'd like is to convert them to *.fasta format so I can complete my "pipeline" analysis I already set-up in Linux...the softwares I intend to use are (in order):

    FastQC
    sickle
    BWA
    Picard
    GATK

    annotate with DAVID or ANNOVAR...

    Please help -- you will earn lots of good "juju"...
  • kbradnam
    Member
    • May 2011
    • 54

    #2
    Your text files are in fact in FASTQ format. If you look for FASTQ to FASTA converters, you will probably find a lot.

    Comment

    • SNPsaurus
      Registered Vendor
      • May 2013
      • 525

      #3
      There are many unix one-line solutions to this, but I often use:
      sed -n '1~4s/^@/>/p;2~4p' FILENAME.fastq > FILENAME.fasta
      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Originally posted by jddavis View Post
        Dear All:

        This is a simple question, but I'm new to bioinformatics and would love to have some help converting file types so that google and I are not "going it alone"...

        Here is the file header/format (from a very old sequencing file):

        @BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
        TGGTGCAAAATATGAAGTCAATAAGATTAAAATAAA
        +BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
        ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZU

        all are saved as s_*_sequence.txt where *=some number...what I'd like is to convert them to *.fasta format so I can complete my "pipeline" analysis I already set-up in Linux...the softwares I intend to use are (in order):

        FastQC
        sickle
        BWA
        Picard
        GATK

        annotate with DAVID or ANNOVAR...

        Please help -- you will earn lots of good "juju"...
        Don't convert your files. Your files in in FASTQ format and FASTQ is the de facto standard for Next Generation Sequence data. All of the tools you have listed (at least the first group there) expect the input to be in FASTQ format. See the Wikipedia page for FASTQ.

        If any of the software complain that the files don't have the proper extension just change .txt to .fq (or .fastq). As your files are old they clearly are using Illumina's old (pre v1.8) Phred+64 quality encoding format (again, see the Wikipedia article). You will either have to specify that you are using phred64 FASTQ files or convert them first to phred33.

        Comment

        • chadn737
          Senior Member
          • Jan 2009
          • 392

          #5
          I'm not sure why you want to convert these to fasta. The files are already in fastq format, which has the quality scores which are absent in fasta. So you loose valuable information in doing the conversion. Furthermore, all those tools take fastq as input. In particular, if you are doing any sort of quality score trimming, then you have to have the quality scores.

          Comment

          • Jeremy
            Senior Member
            • Nov 2009
            • 190

            #6
            just change the file names from .txt to .fastq and they can already be used in all of the programs you listed and almost all other programs designed for next gen sequence data. No conversion necessary.

            Comment

            • jddavis
              Junior Member
              • May 2013
              • 7

              #7
              thank you all! The command-line suggestion was particularly helpful as I am automating the pipeline and this is my first attempt at scripting....glad it is an easy solution.

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                If you are automating this process using a pipeline then keep in mind kmcarr's note about FASTQ quality encoding (post #6). You will need to account for data in old (illumina)/new (sanger) format appropriately.

                Comment

                • jddavis
                  Junior Member
                  • May 2013
                  • 7

                  #9
                  quick follow-up question

                  thank you all for the advice/suggestions. Everything has worked well throughout the pipeline and I'm finished with the Picard steps. The files are ready for reallignment and base recalibration. However I keep getting the following error:

                  Error: Unable to access jarfile ~GenomeAnalysisTK.jar

                  my command line has all the right file locations, etc.

                  Its probably an easy fix, I realize, but would ask for advice anyhow...after this I've finished the first part of the pipeline.

                  Thanks again!

                  Here's the command line:

                  java -Xmx4g -jar ~GenomeAnalysisTK.jar -T BaseRecalibrator -I filepath/sequences/sequencedrmdp.bam -R reference_genome.fas -o recal_data.grp
                  Last edited by jddavis; 05-15-2013, 07:49 AM.

                  Comment

                  • mastal
                    Senior Member
                    • Mar 2009
                    • 666

                    #10
                    You're missing a '/'.

                    the location of your .jar file should be ~/GenomeAnalysisTK.jar

                    Comment

                    • jddavis
                      Junior Member
                      • May 2013
                      • 7

                      #11
                      thanks, but that's not the problem...that was typo in here, but not in command-line...it seems to have trouble opening GenomeAnalysisTK.jar or BSRQ is missing? I didn't see it listed in the tool-kit directory...

                      here's the error...
                      Error: Could not find or load main class ~/GenomeAnalysisTK.jar

                      Comment

                      • kcchan
                        Senior Member
                        • Jul 2012
                        • 186

                        #12
                        Where did you install GATK? You have to point it to the exact place where the file is located, which may not be the home directory.

                        Comment

                        • jddavis
                          Junior Member
                          • May 2013
                          • 7

                          #13
                          Thanks for the advice...I found that my download of GATK does not have the BaseRecalibrator file for some reason...the files that are in resources directory are working just fine....thanks to all for the comments.

                          Comment

                          Latest Articles

                          Collapse

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Today, 11:58 AM
                          0 responses
                          9 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-05-2026, 10:09 AM
                          0 responses
                          25 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-04-2026, 08:59 AM
                          0 responses
                          34 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          56 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...