SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
illumina raw genotype data format nans_bn Illumina/Solexa 1 11-21-2012 05:54 PM
How IGV recognize the reference region? dkrtndhkd Bioinformatics 2 02-01-2012 08:58 AM
Galaxy cannot recognize the uploaded fastq juhang_62 Bioinformatics 3 11-11-2011 01:34 AM
Illumina SNP format to merlin format evoll Bioinformatics 0 05-05-2011 03:39 AM
data format from illumina solexa zhuz Illumina/Solexa 4 12-21-2010 12:52 PM

Reply
 
Thread Tools
Old 01-05-2011, 12:07 PM   #1
sbberes
Member
 
Location: Houston TX

Join Date: Jan 2009
Posts: 22
Default Anybody recognize this Illumina data format?

Dear community,
I just obtained some Illumina sequencing data, but it is in a format that I am unfamiliar with. The files were labeled "probs.txt". Does anybody recognize this format and can suggest software to parse/convert it to fastq? from the header info it It appears to be paired end illumina data with base calls, but the I am not sure if it is followed by quality values or intensities. The numbers following the basecalls come in sets of 4, and the order of the numbers corresponds with ACGT, in that if the first of the four numbers is the highest then the base is an A, if the 2nd number of a set of 4 is the highest then the base is C, etc...

3 1 10 1097#0/1 ATCTA........CCTGGCCACC............. 37 -37 -40 -40 -40 -9 -40 9 -22 17 -25 -20 -0 -40 -34 0 -2 -2 -13 -6 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -13 6 -14 -10 -24 12 -21 -13 -39 -21 -19 17 -40 -4 1 -7 -15 -11 2 -3 -13 8 -16 -12 -11 8 -23 -12 12 -14 -18 -20 -13 6 -12 -10 -15 10 -16 -13 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
3 1 10 1097#0/2 CAGACATCGCGATCGGGTTCGCGATCCGC.CCGAAG -18 16 -33 -21 39 -39 -40 -40 -39 -8 6 -12 34 -38 -37 -40 -30 25 -27 -33 31 -31 -40 -40 -28 -8 -20 8 -17 13 -19 -20 -40 -26 17 -18 -8 5 -16 -10 -40 -10 8 -13 21 -23 -40 -28 -26 -13 -21 12 -9 3 -19 -6 -40 -22 16 -17 -40 -4 -3 -3 -40 -17 16 -26 -30 -13 -16 11 -40 -18 -36 18 -4 -4 -5 -6 -40 -24 15 -16 -24 19 -24 -21 -27 -11 10 -16 21 -24 -32 -24 -24 -15 -19 13 -19 12 -19 -14 -9 6 -13 -14 -16 -8 3 -7 -18 0 -8 -2 -5 -5 -5 -5 -9 1 -4 -12 -14 3 -11 -6 -16 -11 7 -10 14 -14 -34 -22 11 -15 -14 -28 -14 -6 -1 -4
sbberes is offline   Reply With Quote
Old 01-05-2011, 01:55 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,170
Default

Quote:
Originally Posted by sbberes View Post
Dear community,
I just obtained some Illumina sequencing data, but it is in a format that I am unfamiliar with. The files were labeled "probs.txt". Does anybody recognize this format and can suggest software to parse/convert it to fastq? from the header info it It appears to be paired end illumina data with base calls, but the I am not sure if it is followed by quality values or intensities. The numbers following the basecalls come in sets of 4, and the order of the numbers corresponds with ACGT, in that if the first of the four numbers is the highest then the base is an A, if the 2nd number of a set of 4 is the highest then the base is C, etc...
Code:
3	1	10	1097#0/1	ATCTA........CCTGGCCACC.............	  37  -37  -40  -40	 -40   -9  -40    9	 -22   17  -25  -20	  -0  -40  -34    0	  -2   -2  -13   -6	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	 -13    6  -14  -10	 -24   12  -21  -13	 -39  -21  -19   17	 -40   -4    1   -7	 -15  -11    2   -3	 -13    8  -16  -12	 -11    8  -23  -12	  12  -14  -18  -20	 -13    6  -12  -10	 -15   10  -16  -13	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5
3	1	10	1097#0/2	CAGACATCGCGATCGGGTTCGCGATCCGC.CCGAAG	 -18   16  -33  -21	  39  -39  -40  -40	 -39   -8    6  -12	  34  -38  -37  -40	 -30   25  -27  -33	  31  -31  -40  -40	 -28   -8  -20    8	 -17   13  -19  -20	 -40  -26   17  -18	  -8    5  -16  -10	 -40  -10    8  -13	  21  -23  -40  -28	 -26  -13  -21   12	  -9    3  -19   -6	 -40  -22   16  -17	 -40   -4   -3   -3	 -40  -17   16  -26	 -30  -13  -16   11	 -40  -18  -36   18	  -4   -4   -5   -6	 -40  -24   15  -16	 -24   19  -24  -21	 -27  -11   10  -16	  21  -24  -32  -24	 -24  -15  -19   13	 -19   12  -19  -14	  -9    6  -13  -14	 -16   -8    3   -7	 -18    0   -8   -2	  -5   -5   -5   -5	  -9    1   -4  -12	 -14    3  -11   -6	 -16  -11    7  -10	  14  -14  -34  -22	  11  -15  -14  -28	 -14   -6   -1   -4
It appears that someone has shmushed together the old-style Illumina _seq.txt and _prb.txt files, as well as merging the read1 and read2 files for a paired end run. With the reformatting of your file by wrapping it in CODE tags you can more clearly see how the file is formatted.

- Column 1 identifies the lane, column 2 the tile, column 3 the X-coordinate and column 4 (up to the #) the Y-coordinate. Together they make up a unique identifier for the cluster.

- The "#0" is used to identify multiplex IDs (meaningless in this case).

- The number after the / (1 or 2) indicates whether this is read1 or read2 for this particular cluster.

- Column 5 is the sequence.

- The following sets of numbers are quality scores for each base, arranged in groups of 4 which correspond to the probability of each base at the given position. As you correctly surmised they are in the order ACGT. Each group of 4 numbers is separated by a tab and within each group the numbers are separated by one or more spaces.

I would recommend splitting this file back up into its original form, I would also recommend separating them based on read. Columns 1-5 (identifier and sequence) should be put in a file named readN_seq.txt (where N is 1 or 2 depending on read). All of the probability scores should be put in files named readN_prb.txt. There should be one line in each file for each sequence and you must maintain the proper order, there is no identifying information in the _prb.txt files and the following step assumes they are matched line by line with the _seq.txt files.

Use the fq_all2std.pl script which is provided as part of the MAQ package to convert these two files to FASTQ. The command to use for this input is seqprb2std. Sample command:

Code:
> fq_all2std.pl seqprb2std read1_seq.txt read1_prb.txt > read1.fastq
Do likewise for read2 and you should have two properly formatted FASTQ files.
kmcarr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO