Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • converting s_*_sequence.txt file to Fasta

    Dear All:

    This is a simple question, but I'm new to bioinformatics and would love to have some help converting file types so that google and I are not "going it alone"...

    Here is the file header/format (from a very old sequencing file):

    @BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
    TGGTGCAAAATATGAAGTCAATAAGATTAAAATAAA
    +BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
    ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZU

    all are saved as s_*_sequence.txt where *=some number...what I'd like is to convert them to *.fasta format so I can complete my "pipeline" analysis I already set-up in Linux...the softwares I intend to use are (in order):

    FastQC
    sickle
    BWA
    Picard
    GATK

    annotate with DAVID or ANNOVAR...

    Please help -- you will earn lots of good "juju"...

  • #2
    Your text files are in fact in FASTQ format. If you look for FASTQ to FASTA converters, you will probably find a lot.

    Comment


    • #3
      There are many unix one-line solutions to this, but I often use:
      sed -n '1~4s/^@/>/p;2~4p' FILENAME.fastq > FILENAME.fasta
      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

      Comment


      • #4
        Originally posted by jddavis View Post
        Dear All:

        This is a simple question, but I'm new to bioinformatics and would love to have some help converting file types so that google and I are not "going it alone"...

        Here is the file header/format (from a very old sequencing file):

        @BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
        TGGTGCAAAATATGAAGTCAATAAGATTAAAATAAA
        +BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
        ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZU

        all are saved as s_*_sequence.txt where *=some number...what I'd like is to convert them to *.fasta format so I can complete my "pipeline" analysis I already set-up in Linux...the softwares I intend to use are (in order):

        FastQC
        sickle
        BWA
        Picard
        GATK

        annotate with DAVID or ANNOVAR...

        Please help -- you will earn lots of good "juju"...
        Don't convert your files. Your files in in FASTQ format and FASTQ is the de facto standard for Next Generation Sequence data. All of the tools you have listed (at least the first group there) expect the input to be in FASTQ format. See the Wikipedia page for FASTQ.

        If any of the software complain that the files don't have the proper extension just change .txt to .fq (or .fastq). As your files are old they clearly are using Illumina's old (pre v1.8) Phred+64 quality encoding format (again, see the Wikipedia article). You will either have to specify that you are using phred64 FASTQ files or convert them first to phred33.

        Comment


        • #5
          I'm not sure why you want to convert these to fasta. The files are already in fastq format, which has the quality scores which are absent in fasta. So you loose valuable information in doing the conversion. Furthermore, all those tools take fastq as input. In particular, if you are doing any sort of quality score trimming, then you have to have the quality scores.

          Comment


          • #6
            just change the file names from .txt to .fastq and they can already be used in all of the programs you listed and almost all other programs designed for next gen sequence data. No conversion necessary.

            Comment


            • #7
              thank you all! The command-line suggestion was particularly helpful as I am automating the pipeline and this is my first attempt at scripting....glad it is an easy solution.

              Comment


              • #8
                If you are automating this process using a pipeline then keep in mind kmcarr's note about FASTQ quality encoding (post #6). You will need to account for data in old (illumina)/new (sanger) format appropriately.

                Comment


                • #9
                  quick follow-up question

                  thank you all for the advice/suggestions. Everything has worked well throughout the pipeline and I'm finished with the Picard steps. The files are ready for reallignment and base recalibration. However I keep getting the following error:

                  Error: Unable to access jarfile ~GenomeAnalysisTK.jar

                  my command line has all the right file locations, etc.

                  Its probably an easy fix, I realize, but would ask for advice anyhow...after this I've finished the first part of the pipeline.

                  Thanks again!

                  Here's the command line:

                  java -Xmx4g -jar ~GenomeAnalysisTK.jar -T BaseRecalibrator -I filepath/sequences/sequencedrmdp.bam -R reference_genome.fas -o recal_data.grp
                  Last edited by jddavis; 05-15-2013, 07:49 AM.

                  Comment


                  • #10
                    You're missing a '/'.

                    the location of your .jar file should be ~/GenomeAnalysisTK.jar

                    Comment


                    • #11
                      thanks, but that's not the problem...that was typo in here, but not in command-line...it seems to have trouble opening GenomeAnalysisTK.jar or BSRQ is missing? I didn't see it listed in the tool-kit directory...

                      here's the error...
                      Error: Could not find or load main class ~/GenomeAnalysisTK.jar

                      Comment


                      • #12
                        Where did you install GATK? You have to point it to the exact place where the file is located, which may not be the home directory.

                        Comment


                        • #13
                          Thanks for the advice...I found that my download of GATK does not have the BaseRecalibrator file for some reason...the files that are in resources directory are working just fine....thanks to all for the comments.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin


                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                            Yesterday, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          37 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          41 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          35 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          54 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X