SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
BWA - file formats robekubica Bioinformatics 1 08-27-2011 04:07 PM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 07:55 AM
bam output file to psl file fahim Genomic Resequencing 1 12-26-2010 12:27 PM
SNPs and file formats dawe Bioinformatics 2 08-19-2010 09:12 AM
Wig File Formats Sanchari General 8 05-20-2010 11:11 PM

Reply
 
Thread Tools
Old 05-14-2008, 04:27 PM   #1
rebrendi
ng
 
Location: LA

Join Date: May 2008
Posts: 78
Default Output file formats

Hello,

Could you please help me to sort out with the output file formats?
What each column means, and which program (under Windows) to use to read these files?

Thank you very much!

(There are three output files,
First file has a name like s_2_0001_seq.txt and looks like this:

2 1 912 885 TGGCAAGGAAAATAAAATCAAAAA
2 1 901 884 TGGTACATATACACCATGAAATAT
2 1 897 115 TGAAGGACCAGAGTGCCTGGACTT
2 1 933 879 AAGGCAACAAAAAGAGACTCCATA
2 1 888 104 TGGGACACATTTAAAGCAACGAGA
2 1 920 116 AATCCAGAAGTGGGGGCCTGTGCA
2 1 920 894 TCAAAACTGAAACACTTCCCATCA
2 1 900 896 TGTCATCCTGAAGTGCAGTGGATA
2 1 896 921 TTAGGAAAAAACAAAAAACAAAAA
2 1 886 105 AGGGAAAATGGAAAAATAACAAAC
2 1 876 955 TACCAAACATTTGAGGCAGAAATG


Second file is named as s_2_0001_sig2.txt and looks like this:


2 1 912 885 2925.5 5978.4 913.5 6583.8 1032.0 1638.7 4854.6 221.5 109.3 1888.1 2405.1 2398.7 1271.8 4134.4 737.4 -222.2 2190.1 1825.4 679.2 -2.1 3953.1 277.9 391.4 334.6 1393.2 972.6 3596.5 1032.0 391.4 53.5 2777.0 -167.5 2737.7 277.9 913.5 109.3 2584.9 2398.7 334.6 109.3 2628.7 972.6 448.5 -57.4 2157.6 -705.0 165.3 1700.7 277.9 1091.6 448.5 1819.4 3357.7 -2.1 -2.1 1762.9 2325.5 277.9 1271.8 -222.2 1105.7 1032.0 109.3 109.3 2815.2 -112.5 391.4 1211.5 277.9 334.6 1151.4 1263.3 913.5 1464.4 505.9 165.3 2341.2 505.9 165.3 165.3 2221.7 -492.5 221.5 -112.5 1853.9 505.9 109.3 1393.2 492.8 165.3 448.5 165.3 1906.0 53.5 563.4 563.4

And the third file is named like s_2_0001_prb.txt, and looks like this:

-40 -5 -40 5 -40 -40 40 -40 -40 -40 1 -1 -40 40 -40 -40 13 -13 -40 -40 40 -40 -40 -40 -40 -40 40 -40 -40 -40 40 -40 40 -40 -40 -40 6 -6 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 40 -40 -40 -40 2 -2 -40 -40 40 -40 -40 -40 -40 -40 -19 19 -19 19 -40 -40 40 -40 -40 -40 40 -40 -40 -40 18 -31 -40 -18 0 -10 -2 -22 40 -40 -40 -40
-40 -40 -40 40 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 -40 -40 40
-40 -40 -40 40 -40 -40 40 -40 40 -40 -40 -40 40

Last edited by rebrendi; 05-14-2008 at 04:42 PM.
rebrendi is offline   Reply With Quote
Old 05-14-2008, 05:20 PM   #2
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Hey Rebrendi, I fully expect someone more knowledgeable than myself will chime in soon, but I do know that the PRB file is the per base quality file. There is more info here:

Quote:
For the latter one, four numbers per base are listed to present the negative log-transform of the probabilities of four nucleotides (A, C, G, T) to be sequenced at this base position.
...from http://rulai.cshl.edu/rmap/

So it's possible to use the PRB as the sequence as well.

Hopefully that will tide you over until said smarter person appears!
ECO is offline   Reply With Quote
Old 05-14-2008, 06:24 PM   #3
rebrendi
ng
 
Location: LA

Join Date: May 2008
Posts: 78
Default

Thank you very much, ECO!

So, I still need answers to the two other file formats, and the program to read them under windows (if it exists). I wonder, is it possible to extract the nucleotide numbers in the genome from these short raw data? And there are also other tags encoded somewhere in these files.
rebrendi is offline   Reply With Quote
Old 05-14-2008, 11:47 PM   #4
cgb
Member
 
Location: Cambridge

Join Date: May 2008
Posts: 50
Default

the sig2 files are processed "traces" you can draw a bar chart with them for each sequence. The seq files are the final data - its trivial to convert the seq and prb files into a fastq file - there are tools floating around to do this.

generally the key is the first 4 columns : lane, tiles, x, y for the given cluster that gave the sequence.
cgb is offline   Reply With Quote
Old 05-14-2008, 11:49 PM   #5
rebrendi
ng
 
Location: LA

Join Date: May 2008
Posts: 78
Default

Quote:
Originally Posted by cgb View Post
generally the key is the first 4 columns : lane, tiles, x, y for the given cluster that gave the sequence.
well, so what do they mean these columns?

Last edited by rebrendi; 05-14-2008 at 11:52 PM.
rebrendi is offline   Reply With Quote
Old 05-14-2008, 11:54 PM   #6
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Lane = 1-8 (which channel of the flowcell)

X,Y = physical location of the cluster on the flowcell...
ECO is offline   Reply With Quote
Old 05-15-2008, 12:02 AM   #7
rebrendi
ng
 
Location: LA

Join Date: May 2008
Posts: 78
Default

Quote:
Originally Posted by ECO View Post
Lane = 1-8 (which channel of the flowcell)

X,Y = physical location of the cluster on the flowcell...
thanks....
rebrendi is offline   Reply With Quote
Old 05-15-2008, 01:50 AM   #8
cgb
Member
 
Location: Cambridge

Join Date: May 2008
Posts: 50
Default

Not quite....

the flowcell has 8 lanes. lane number is the lane. each lane has up to 330 'tiles' they are numbered in a snakey pattern, the X,Y is the cluster co-ordinate on the given tile
cgb is offline   Reply With Quote
Old 05-15-2008, 02:02 AM   #9
cgb
Member
 
Location: Cambridge

Join Date: May 2008
Posts: 50
Default

... on the sig2 files - your row (= cluster) has the same key for the first 4 cols. then you have 4 values for A,C,G,T <Tab> A,C,T,G etc.... up to cycle number

note - your quality values are raw Qscores emitted by Bustard and will not be wel calibrated.
cgb is offline   Reply With Quote
Old 05-15-2008, 07:54 AM   #10
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Quote:
Originally Posted by cgb View Post
the sig2 files are processed "traces" you can draw a bar chart with them for each sequence. The seq files are the final data - its trivial to convert the seq and prb files into a fastq file - there are tools floating around to do this.

generally the key is the first 4 columns : lane, tiles, x, y for the given cluster that gave the sequence.
cgb,
can you say more on these programs that convert prb + seq into fastq format?
There is this _sequence.txt output per lane as well, that is the reads in seq file minus the QC reads that fail chastity filter. This can then be converted to fastq using one of the MAQ utilities.

Any advantage of using seq + prb, instead of the filtered _sequence? I have heard from MAQ, SSAHA and other authors that using the filtered file is preferred to get better alignment results using their tools

sm
bioinfosm is offline   Reply With Quote
Old 05-15-2008, 12:02 PM   #11
cgb
Member
 
Location: Cambridge

Join Date: May 2008
Posts: 50
Default

have a look on the sanger site - if not mail jkb@sanger.ac.uk or ts6@sanger.ac.uk
cgb is offline   Reply With Quote
Old 05-15-2008, 02:50 PM   #12
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by bioinfosm View Post
cgb,
There is this _sequence.txt output per lane as well, that is the reads in seq file minus the QC reads that fail chastity filter. This can then be converted to fastq using one of the MAQ utilities.
On our pipeline, the _sequence.txt file only has 32 bases of sequence. If you are using SOAP or Maq, or you are doing more than 36 bases, you don't want to lose all those bases. Maybe you can fool around with the pipeline to get it to output more, but I don't know how. It also uses a non-standard quailty scoring format, but that's not a deal-breaker.

I made a <50 line perl thingie to take the .prb and .seq files to make a fastq. If I can do it, it can't be that hard
swbarnes2 is offline   Reply With Quote
Old 05-15-2008, 03:25 PM   #13
ScottC
Senior Member
 
Location: Monash University, Melbourne, Australia.

Join Date: Jan 2008
Posts: 246
Default

Quote:
Originally Posted by cgb View Post
Note - your quality values are raw Qscores emitted by Bustard and will not be well calibrated.

Hi cgb,

Can you expand on this a bit more please?

Cheers,

Scott.
ScottC is offline   Reply With Quote
Old 05-16-2008, 01:42 AM   #14
cgb
Member
 
Location: Cambridge

Join Date: May 2008
Posts: 50
Default

the scores are supposed to reflect the chances of a basecall being in error, 20 = 1 in 100 etc. If they do this accurately they are "calibrated". Raw Bustard scores are not well calibrated - it tends to over score and underscore bases and shove a lot into a Q40 bin (wrongly). he scores can be adjusted after the fact using several well known methods - the newer (0.4) / 1.0 release of the GAPipeline allows for some degree of recalibration using control lane data.
cgb is offline   Reply With Quote
Old 05-16-2008, 04:27 AM   #15
SillyPoint
Member
 
Location: Frederick MD, USA

Join Date: May 2008
Posts: 39
Default

To amplify a bit on cgb's posting: If you align your reads to a known, error-free reference (e.g., PhiX), you can then count the true errors and establish a true error rate. Compare this to the estimated error rate embodied in the Q scores. They should match: Out of all the Q30 bases in all the reads, there should be 1 error in 1000, and so on for each Q value.

An easy place to find this information is in the s_<lane>_qreport.txt file produced by Gerald when you do an alignment on the lane (ANALYSIS default or Eland). What you'll see there is that what's called Q40 really has 0.5% errors = Q23.
SillyPoint is offline   Reply With Quote
Old 05-16-2008, 04:52 AM   #16
SillyPoint
Member
 
Location: Frederick MD, USA

Join Date: May 2008
Posts: 39
Default

Further to the OP's original question:

The _seq.txt file is just as cgb says: lane, tile, X, Y, sequence. X & Y are in pixels relative to the upper left corner of each tile image, with +X to the right, and +Y down (don't ask).

The _sig2.txt also file starts with lane, tile, X, Y. The rest is intensities for each base, each cycle. Intensities have been corrected for crosstalk and phasing. Pay attention here: For each cycle, there are four values (a,c,g,t). They are separated by *blanks*. Cycles (4 values) are in turn separates by *tabs*.

The _prb.txt file contains base probabilities arranged the same way. No lane/tile/x/y here, though. The probabilities are given Solexa-style: Q = 10 * log (P/(1-P)), where P is the probability that the base is a/c/g/t. Not to be confused with phread-style scores, encoded as Q = -10 * log (E), where E is the probability of an *incorrect* call.

Having said all that, I'm moved to enquire: Why are you looking at what are really intermediate data files? The end product of the pipeline for most purposes is the _sequence.txt files produced by the Gerald step. There you will find what amounts to fastq-format files, containing sequence and base scores, plus lane/tile/X/Y. Only beware that the scores are Solexa-style and encoded as ascii by adding 64 (so Q40='h'). maq expects a true fastq file, with phred-style scores plus 33 (Q40='I').
SillyPoint is offline   Reply With Quote
Old 02-03-2009, 04:32 AM   #17
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default

Sorry for this basic question. But what is the tile? Is it the pictures of the lane? So, each little photographed square in a given lane is called a tile (tile 1, tile 2 ,etc etc)?
ines
inesdesantiago is offline   Reply With Quote
Old 02-03-2009, 05:14 AM   #18
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

Quote:
Originally Posted by inesdesantiago View Post
Sorry for this basic question. But what is the tile? Is it the pictures of the lane? So, each little photographed square in a given lane is called a tile (tile 1, tile 2 ,etc etc)?
ines
Yes, that is correct. Tiles do not exist in any physical sense, they are just the sections of each lane as they are imaged by the camera. On the current generation instrument, GAII, there are 100 tiles per lane, made up of two columns of 50 tiles each. The tiles are numbered starting with #1 at the top left of a lane, down to #50 at the bottom left, over to #51 at the bottom right then up to #100 at the top right.
kmcarr is offline   Reply With Quote
Old 02-03-2009, 05:40 AM   #19
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default

Good to know! Thanks for the reply!
Ines
inesdesantiago is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO