SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Default Change in CASAVA / BCL->FASTQ skruglyak Bioinformatics 20 11-19-2012 01:51 PM
How to do CASAVA alignment by using fastq files weasteam Bioinformatics 2 01-03-2012 12:18 PM
Casava 1.8 Fastq format for using with BWA a14418e10 Bioinformatics 3 11-11-2011 10:13 AM
CASAVA v1.8 (Bcl to Fastq) Kacper Illumina/Solexa 2 08-04-2011 10:08 PM
fastq files generated by Casava-Eland casava Bioinformatics 1 11-19-2010 05:56 AM

Reply
 
Thread Tools
Old 01-11-2011, 01:24 PM   #1
Airwalker810
Junior Member
 
Location: East Coast, US

Join Date: Oct 2010
Posts: 6
Default Help with FastQ/CASAVA format problems

Hey all, a newbie here, and not sure if this is the appropriate place to post this but was wondering if I could get some help with an issue involving Illumina deepseq data. I'm trying to run a batch of deepseq data that we have recently got through CASAVA v 1.7 and align it to a genome. The file is formated in .fastq and the reads look like this:



@6:1:1410:944:N
NNNNCAAACACAAAGTTACCTAAACTATAGAAGTCAAACA
+
####&&()''@@@@@@8@@@31888@@@@@3885817775



However, when I try to run it through the program, it gives the following error:



Could not identify index of the following line:
*********************************
6:1:1410:944:N
*********************************

Please check your files, we expect the following syntax:
<machine-id>_<run-number>(flow_cell-id):lane:tile:x:y#<index>:<pair>
machine-id: all characters except '_'



I realize this is a formating issue as CASAVA wants the file in the format of:



@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index
>/<read_#>



But am unsure how to go about fixing it. I'm pretty sure the machine_id is missing, as well as any information dealing with the index and read. Any help would be much appreciated. Thanks!
Airwalker810 is offline   Reply With Quote
Old 01-11-2011, 02:23 PM   #2
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

You could make a small script to chug through the file and add the machine id field (either the real one if you can acquire it, or else a made-up placeholder).

Regarding the "#<index>:<pair>" fields, some more info on the experiment might be needed. Is this single-end or paired-end (and how many data files are there? Illumina paired-end data usually comes in paired files with each read pair positioned on corresponding lines in the files). Any multiplexing?
gaffa is offline   Reply With Quote
Old 01-12-2011, 06:54 AM   #3
Airwalker810
Junior Member
 
Location: East Coast, US

Join Date: Oct 2010
Posts: 6
Default

It is not paired ends, and I'm almost certain there is no multiplexing at all in the sample. A sample input would be great help. Thanks for the assistance!

Last edited by Airwalker810; 01-12-2011 at 06:57 AM.
Airwalker810 is offline   Reply With Quote
Old 01-12-2011, 09:02 AM   #4
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

If it's single-end and no multiplexing, then you have all the information you need and it should just be a matter of formatting the ID line to make your program happy. The program is expecting read ID lines to look like this:

@ILxx_1234:1:1:1103:6172#1/1
@ILxx_1234:1:1:1103:16929#7/1
@ILxx_1234:1:1:1103:13497#2/2

where the first field is the ID/name of the machine that performed the experiment followed by the run number, the number after the "#" is the sample ID (if there are multiple samples) and the number after the "/" is the pair info for paired-end experiments (so it's either 1 or 2). If the program really wants a machine name, I guess you could just make up a phony machine name (ILmymachine_0001 or something more clever or whatever) for the first field. And since you have only a single sample, if the program really wants an index I guess you could just add "#1" after the y-coordinate (removing the ":N" part - I'm not sure what it signifies). For the pair-info, my guess is that you can just leave that info out (i.e. simply skip the "/1" part) and the program will treat the data as single-end.

(NOTE: I don't know anything about CASAVA - as I understand things it is Illumina's own program that can do a bunch of stuff. It's not inconceivable that CASAVA itself can generate the correct ID lines from lower level files - but again I don't know much about the pre-fastq pipeline.)

If you know a little Perl or Python scripting you should be able to make those changes to the ID lines to make CASAVA accept them - however this is just a quick-and-dirty practical fix, I don't know the underlying reason why your read ID lines look they way they do (maybe whoever generated the files does).
gaffa is offline   Reply With Quote
Old 01-12-2011, 09:20 AM   #5
Airwalker810
Junior Member
 
Location: East Coast, US

Join Date: Oct 2010
Posts: 6
Default

Thanks for the help, should make things a bit easier with a little scripting. Yeah, I'm not sure what the deal with this data is, as I said, it was outsourced, and it came back looking like this mess. No idea why specific lines are missing from the data. My lab just procured a DeepSeq machine and I'm trying to force the data through that pipeline to make everything from the past and future work on the same analysis program.
Airwalker810 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:03 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO