Seqanswers Leaderboard Ad

**kmcarr** · 01-05-2011, 01:55 PM

Originally posted by sbberes View Post

Dear community,
I just obtained some Illumina sequencing data, but it is in a format that I am unfamiliar with. The files were labeled "probs.txt". Does anybody recognize this format and can suggest software to parse/convert it to fastq? from the header info it It appears to be paired end illumina data with base calls, but the I am not sure if it is followed by quality values or intensities. The numbers following the basecalls come in sets of 4, and the order of the numbers corresponds with ACGT, in that if the first of the four numbers is the highest then the base is an A, if the 2nd number of a set of 4 is the highest then the base is C, etc...

Code:

3	1	10	1097#0/1	ATCTA........CCTGGCCACC.............	  37  -37  -40  -40	 -40   -9  -40    9	 -22   17  -25  -20	  -0  -40  -34    0	  -2   -2  -13   -6	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	 -13    6  -14  -10	 -24   12  -21  -13	 -39  -21  -19   17	 -40   -4    1   -7	 -15  -11    2   -3	 -13    8  -16  -12	 -11    8  -23  -12	  12  -14  -18  -20	 -13    6  -12  -10	 -15   10  -16  -13	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5
3	1	10	1097#0/2	CAGACATCGCGATCGGGTTCGCGATCCGC.CCGAAG	 -18   16  -33  -21	  39  -39  -40  -40	 -39   -8    6  -12	  34  -38  -37  -40	 -30   25  -27  -33	  31  -31  -40  -40	 -28   -8  -20    8	 -17   13  -19  -20	 -40  -26   17  -18	  -8    5  -16  -10	 -40  -10    8  -13	  21  -23  -40  -28	 -26  -13  -21   12	  -9    3  -19   -6	 -40  -22   16  -17	 -40   -4   -3   -3	 -40  -17   16  -26	 -30  -13  -16   11	 -40  -18  -36   18	  -4   -4   -5   -6	 -40  -24   15  -16	 -24   19  -24  -21	 -27  -11   10  -16	  21  -24  -32  -24	 -24  -15  -19   13	 -19   12  -19  -14	  -9    6  -13  -14	 -16   -8    3   -7	 -18    0   -8   -2	  -5   -5   -5   -5	  -9    1   -4  -12	 -14    3  -11   -6	 -16  -11    7  -10	  14  -14  -34  -22	  11  -15  -14  -28	 -14   -6   -1   -4

It appears that someone has shmushed together the old-style Illumina _seq.txt and _prb.txt files, as well as merging the read1 and read2 files for a paired end run. With the reformatting of your file by wrapping it in CODE tags you can more clearly see how the file is formatted.

- Column 1 identifies the lane, column 2 the tile, column 3 the X-coordinate and column 4 (up to the #) the Y-coordinate. Together they make up a unique identifier for the cluster.

- The "#0" is used to identify multiplex IDs (meaningless in this case).

- The number after the / (1 or 2) indicates whether this is read1 or read2 for this particular cluster.

- Column 5 is the sequence.

- The following sets of numbers are quality scores for each base, arranged in groups of 4 which correspond to the probability of each base at the given position. As you correctly surmised they are in the order ACGT. Each group of 4 numbers is separated by a tab and within each group the numbers are separated by one or more spaces.

I would recommend splitting this file back up into its original form, I would also recommend separating them based on read. Columns 1-5 (identifier and sequence) should be put in a file named readN_seq.txt (where N is 1 or 2 depending on read). All of the probability scores should be put in files named readN_prb.txt. There should be one line in each file for each sequence and you must maintain the proper order, there is no identifying information in the _prb.txt files and the following step assumes they are matched line by line with the _seq.txt files.

Use the fq_all2std.pl script which is provided as part of the MAQ package to convert these two files to FASTQ. The command to use for this input is seqprb2std. Sample command:

Code:

> fq_all2std.pl seqprb2std read1_seq.txt read1_prb.txt > read1.fastq

Do likewise for read2 and you should have two properly formatted FASTQ files.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 23 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Anybody recognize this Illumina data format?

Comment

Latest Articles

ad_right_rmr

News