Unconfigured Ad

**kbradnam** · 05-13-2013, 08:37 AM

Your text files are in fact in FASTQ format. If you look for FASTQ to FASTA converters, you will probably find a lot.

**SNPsaurus** · 05-13-2013, 08:37 AM

There are many unix one-line solutions to this, but I often use:
sed -n '1~4s/^@/>/p;2~4p' FILENAME.fastq > FILENAME.fasta

**kmcarr** · 05-13-2013, 09:12 AM

Originally posted by jddavis View Post

Dear All:

This is a simple question, but I'm new to bioinformatics and would love to have some help converting file types so that google and I are not "going it alone"...

Here is the file header/format (from a very old sequencing file):

@BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
TGGTGCAAAATATGAAGTCAATAAGATTAAAATAAA
+BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZU

all are saved as s_*_sequence.txt where *=some number...what I'd like is to convert them to *.fasta format so I can complete my "pipeline" analysis I already set-up in Linux...the softwares I intend to use are (in order):

FastQC
sickle
BWA
Picard
GATK

annotate with DAVID or ANNOVAR...

Please help -- you will earn lots of good "juju"...

Don't convert your files. Your files in in FASTQ format and FASTQ is the de facto standard for Next Generation Sequence data. All of the tools you have listed (at least the first group there) expect the input to be in FASTQ format. See the Wikipedia page for FASTQ.

If any of the software complain that the files don't have the proper extension just change .txt to .fq (or .fastq). As your files are old they clearly are using Illumina's old (pre v1.8) Phred+64 quality encoding format (again, see the Wikipedia article). You will either have to specify that you are using phred64 FASTQ files or convert them first to phred33.

**chadn737** · 05-13-2013, 09:13 AM

I'm not sure why you want to convert these to fasta. The files are already in fastq format, which has the quality scores which are absent in fasta. So you loose valuable information in doing the conversion. Furthermore, all those tools take fastq as input. In particular, if you are doing any sort of quality score trimming, then you have to have the quality scores.

**Jeremy** · 05-13-2013, 07:20 PM

just change the file names from .txt to .fastq and they can already be used in all of the programs you listed and almost all other programs designed for next gen sequence data. No conversion necessary.

**jddavis** · 05-14-2013, 06:49 AM

thank you all! The command-line suggestion was particularly helpful as I am automating the pipeline and this is my first attempt at scripting....glad it is an easy solution.

**GenoMax** · 05-14-2013, 08:05 AM

If you are automating this process using a pipeline then keep in mind kmcarr's note about FASTQ quality encoding (post #6). You will need to account for data in old (illumina)/new (sanger) format appropriately.

**jddavis** · 05-15-2013, 07:47 AM

quick follow-up question

thank you all for the advice/suggestions. Everything has worked well throughout the pipeline and I'm finished with the Picard steps. The files are ready for reallignment and base recalibration. However I keep getting the following error:

Error: Unable to access jarfile ~GenomeAnalysisTK.jar

my command line has all the right file locations, etc.

Its probably an easy fix, I realize, but would ask for advice anyhow...after this I've finished the first part of the pipeline.

Thanks again!

Here's the command line:

java -Xmx4g -jar ~GenomeAnalysisTK.jar -T BaseRecalibrator -I filepath/sequences/sequencedrmdp.bam -R reference_genome.fas -o recal_data.grp

**mastal** · 05-15-2013, 08:13 AM

You're missing a '/'.

the location of your .jar file should be ~/GenomeAnalysisTK.jar

**jddavis** · 05-15-2013, 08:16 AM

thanks, but that's not the problem...that was typo in here, but not in command-line...it seems to have trouble opening GenomeAnalysisTK.jar or BSRQ is missing? I didn't see it listed in the tool-kit directory...

here's the error...
Error: Could not find or load main class ~/GenomeAnalysisTK.jar

**kcchan** · 05-15-2013, 12:43 PM

Where did you install GATK? You have to point it to the exact place where the file is located, which may not be the home directory.

**jddavis** · 05-15-2013, 01:21 PM

Thanks for the advice...I found that my download of GATK does not have the BaseRecalibrator file for some reason...the files that are in resources directory are working just fine....thanks to all for the comments.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, Today, 11:58 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Today, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 56 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

converting s_*_sequence.txt file to Fasta

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News