SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting FASTA/qual file pair from 454 to FASTQ oiiio Bioinformatics 9 01-01-2016 03:55 PM
Updated How to convert .txt file to .bed .GFF or .BAR file format, forevermark4 Bioinformatics 2 06-30-2014 05:02 AM
Converting GEO database TXT format to fasta Etherella Bioinformatics 4 09-03-2012 04:16 AM
converting consensus fastq to fasta zlu Bioinformatics 18 08-17-2011 09:11 AM
s_*_export.txt VS s_*_sequence.txt zhuj Illumina/Solexa 5 06-08-2010 01:35 PM

Reply
 
Thread Tools
Old 05-13-2013, 08:11 AM   #1
jddavis
Junior Member
 
Location: Texas

Join Date: May 2013
Posts: 7
Default converting s_*_sequence.txt file to Fasta

Dear All:

This is a simple question, but I'm new to bioinformatics and would love to have some help converting file types so that google and I are not "going it alone"...

Here is the file header/format (from a very old sequencing file):

@BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
TGGTGCAAAATATGAAGTCAATAAGATTAAAATAAA
+BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZU

all are saved as s_*_sequence.txt where *=some number...what I'd like is to convert them to *.fasta format so I can complete my "pipeline" analysis I already set-up in Linux...the softwares I intend to use are (in order):

FastQC
sickle
BWA
Picard
GATK

annotate with DAVID or ANNOVAR...

Please help -- you will earn lots of good "juju"...
jddavis is offline   Reply With Quote
Old 05-13-2013, 08:37 AM   #2
kbradnam
Member
 
Location: Davis, CA

Join Date: May 2011
Posts: 53
Default

Your text files are in fact in FASTQ format. If you look for FASTQ to FASTA converters, you will probably find a lot.
kbradnam is offline   Reply With Quote
Old 05-13-2013, 08:37 AM   #3
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 501
Default

There are many unix one-line solutions to this, but I often use:
sed -n '1~4s/^@/>/p;2~4p' FILENAME.fastq > FILENAME.fasta
SNPsaurus is offline   Reply With Quote
Old 05-13-2013, 09:12 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

Quote:
Originally Posted by jddavis View Post
Dear All:

This is a simple question, but I'm new to bioinformatics and would love to have some help converting file types so that google and I are not "going it alone"...

Here is the file header/format (from a very old sequencing file):

@BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
TGGTGCAAAATATGAAGTCAATAAGATTAAAATAAA
+BRITNEYSPEARS_1_FC203R3AAXX:2:1:208:502
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZU

all are saved as s_*_sequence.txt where *=some number...what I'd like is to convert them to *.fasta format so I can complete my "pipeline" analysis I already set-up in Linux...the softwares I intend to use are (in order):

FastQC
sickle
BWA
Picard
GATK

annotate with DAVID or ANNOVAR...

Please help -- you will earn lots of good "juju"...
Don't convert your files. Your files in in FASTQ format and FASTQ is the de facto standard for Next Generation Sequence data. All of the tools you have listed (at least the first group there) expect the input to be in FASTQ format. See the Wikipedia page for FASTQ.

If any of the software complain that the files don't have the proper extension just change .txt to .fq (or .fastq). As your files are old they clearly are using Illumina's old (pre v1.8) Phred+64 quality encoding format (again, see the Wikipedia article). You will either have to specify that you are using phred64 FASTQ files or convert them first to phred33.
kmcarr is offline   Reply With Quote
Old 05-13-2013, 09:13 AM   #5
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

I'm not sure why you want to convert these to fasta. The files are already in fastq format, which has the quality scores which are absent in fasta. So you loose valuable information in doing the conversion. Furthermore, all those tools take fastq as input. In particular, if you are doing any sort of quality score trimming, then you have to have the quality scores.
chadn737 is offline   Reply With Quote
Old 05-13-2013, 07:20 PM   #6
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

just change the file names from .txt to .fastq and they can already be used in all of the programs you listed and almost all other programs designed for next gen sequence data. No conversion necessary.
Jeremy is offline   Reply With Quote
Old 05-14-2013, 06:49 AM   #7
jddavis
Junior Member
 
Location: Texas

Join Date: May 2013
Posts: 7
Default

thank you all! The command-line suggestion was particularly helpful as I am automating the pipeline and this is my first attempt at scripting....glad it is an easy solution.
jddavis is offline   Reply With Quote
Old 05-14-2013, 08:05 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

If you are automating this process using a pipeline then keep in mind kmcarr's note about FASTQ quality encoding (post #6). You will need to account for data in old (illumina)/new (sanger) format appropriately.
GenoMax is offline   Reply With Quote
Old 05-15-2013, 07:47 AM   #9
jddavis
Junior Member
 
Location: Texas

Join Date: May 2013
Posts: 7
Default quick follow-up question

thank you all for the advice/suggestions. Everything has worked well throughout the pipeline and I'm finished with the Picard steps. The files are ready for reallignment and base recalibration. However I keep getting the following error:

Error: Unable to access jarfile ~GenomeAnalysisTK.jar

my command line has all the right file locations, etc.

Its probably an easy fix, I realize, but would ask for advice anyhow...after this I've finished the first part of the pipeline.

Thanks again!

Here's the command line:

java -Xmx4g -jar ~GenomeAnalysisTK.jar -T BaseRecalibrator -I filepath/sequences/sequencedrmdp.bam -R reference_genome.fas -o recal_data.grp

Last edited by jddavis; 05-15-2013 at 07:49 AM.
jddavis is offline   Reply With Quote
Old 05-15-2013, 08:13 AM   #10
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

You're missing a '/'.

the location of your .jar file should be ~/GenomeAnalysisTK.jar
mastal is offline   Reply With Quote
Old 05-15-2013, 08:16 AM   #11
jddavis
Junior Member
 
Location: Texas

Join Date: May 2013
Posts: 7
Default

thanks, but that's not the problem...that was typo in here, but not in command-line...it seems to have trouble opening GenomeAnalysisTK.jar or BSRQ is missing? I didn't see it listed in the tool-kit directory...

here's the error...
Error: Could not find or load main class ~/GenomeAnalysisTK.jar
jddavis is offline   Reply With Quote
Old 05-15-2013, 12:43 PM   #12
kcchan
Senior Member
 
Location: USA

Join Date: Jul 2012
Posts: 182
Default

Where did you install GATK? You have to point it to the exact place where the file is located, which may not be the home directory.
kcchan is offline   Reply With Quote
Old 05-15-2013, 01:21 PM   #13
jddavis
Junior Member
 
Location: Texas

Join Date: May 2013
Posts: 7
Default

Thanks for the advice...I found that my download of GATK does not have the BaseRecalibrator file for some reason...the files that are in resources directory are working just fine....thanks to all for the comments.
jddavis is offline   Reply With Quote
Reply

Tags
file format conversion

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO