SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq byb121 Bioinformatics 6 12-20-2013 01:26 AM
i converted illumina fastq into sanger fastq, need advice Aicen Bioinformatics 5 08-27-2012 06:24 AM
Convert illumina v1.5 fastq to sanger fastq zouzou Bioinformatics 29 05-14-2012 09:07 PM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 04:35 PM
The Sanger and Solexa/Illumina FASTQ formats paper vadim Bioinformatics 0 12-18-2009 04:20 AM

Reply
 
Thread Tools
Old 04-15-2009, 07:49 PM   #1
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default Solution to Sanger/Solexa/Illumina FASTQ confusion

Many of the posts in this forum are with regard to confusion over the FASTQ format and the variations in quality value and ASCII encodings.

To help solve this once and for all, I have written a first draft Wikipedia page for the FASTQ format.

http://en.wikipedia.org/wiki/FASTQ_format

I hope that knowledgable members on this forum can help me improve the page and correct any errors!

Thank you

Torst
Torst is offline   Reply With Quote
Old 04-16-2009, 04:50 AM   #2
BAJ
Member
 
Location: Paris

Join Date: Nov 2008
Posts: 15
Default

maybe you can also describe the various header lines and what they mean...
Illumina gives something like this:
@HWI-EAS285:1:1:1582:1499#0/1
swift outputs:
@L1-100:474:2

Unfortunately I don't know what the numbers mean. the "@HWI_EAS285" and "@L1" are user specified names.
in Illumina the following ":1" refers to the lane and then to the tile (I believe).
I am inclined to believe the following numbers refer to the x/y coordinates of the registered images, but I don't know for sure...

Thx, Bernd
BAJ is offline   Reply With Quote
Old 04-16-2009, 05:16 AM   #3
dcjamison
Member
 
Location: Cincinnati

Join Date: Oct 2008
Posts: 15
Default

Very nice. One minor issue:

"Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99."

The "99" should be 104, or else the range is only 0 to 35.

Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.

Curt

Last edited by dcjamison; 04-16-2009 at 05:27 AM. Reason: added the minor quibble
dcjamison is offline   Reply With Quote
Old 04-16-2009, 05:01 PM   #4
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Quote:
Originally Posted by BAJ View Post
maybe you can also describe the various header lines and what they mean... Illumina gives something like this:
@HWI-EAS285:1:1:1582:1499#0/1
I am not 100% sure of the fields, and my colleague has contacted Illumina for clarification, but what I do know I have added to the Wiki page:

http://en.wikipedia.org/wiki/FASTQ_format

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 unknown
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
Torst is offline   Reply With Quote
Old 04-16-2009, 05:05 PM   #5
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Curt,

Quote:
Originally Posted by dcjamison View Post
Very nice. One minor issue:
"Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99." The "99" should be 104, or else the range is only 0 to 35.
Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.
I have fixed the 99/104 typo, thank you for replying!

The 1.3 Pipeline user manual says it uses pure Phred scores -10*log10(e) but it does NOT clarify how it maps it to ASCII. As these can not be negative, I am somewhat confused
Torst is offline   Reply With Quote
Old 04-17-2009, 01:03 AM   #6
chris
Member
 
Location: Dundee, Scotland

Join Date: Apr 2008
Posts: 52
Default

That's a useful page, thanks for setting it up.

Regarding the Phred -> Seloxa quality scores I think it's worth mentioning this paper:
http://nar.oxfordjournals.org/cgi/co...act/36/16/e105

As they show (in Table 3) that the Solexa error rates are not comparable to Phred at the same score. e.g. Phred has an error rate of 0.01% at score 40, but solexa has calculated error of 0.43% at score 40.

Overall, Solexa is overly optimistic at high quality scores and overly pessimistic at low quality scores.
chris is offline   Reply With Quote
Old 04-17-2009, 01:45 AM   #7
clivey
Member
 
Location: Oxford

Join Date: Jul 2008
Posts: 24
Default

you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.
clivey is offline   Reply With Quote
Old 04-17-2009, 02:17 AM   #8
chris
Member
 
Location: Dundee, Scotland

Join Date: Apr 2008
Posts: 52
Default

I don't think it matters that Q40 != Q40 just as long as people are aware of the fact. Which I didn't think was the case in this thread.
chris is offline   Reply With Quote
Old 04-17-2009, 07:17 AM   #9
dlepp
Junior Member
 
Location: Canada

Join Date: Mar 2009
Posts: 5
Default

Quote:
Originally Posted by clivey View Post
you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.
I wonder if you could explain the recalibration and point towards some tools?

Thanks.
dlepp is offline   Reply With Quote
Old 07-22-2009, 05:35 AM   #10
ohlsson
Junior Member
 
Location: Uppsala, Sweden

Join Date: Jun 2009
Posts: 4
Default

Great job, Torst! I have been struggling to get a grip of those Illumina FASTQ headers for a month now, but somehow I missed your wiki page.
I'm still not clear on one point though. I have a heap of data from a multiplexed run on Illumina GA2. The read headers largely fit your description, but what puzzles me is the index part:
@HWI-EAS178:1:1:2:1349#TGGCAT/1
As you can see, instead of an index number I have a short nucleotide sequence, which I suppose is meant to be the multiplex index sequence. As a rule, these 6-mer tags do not appear in the read sequence that follows. Do you think that they represent the multiplex index tags?

Many thanks for any suggestions!
/Ingemar
ohlsson is offline   Reply With Quote
Old 08-04-2009, 02:19 AM   #11
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

ohlsson,

The nucleotide sequence instead of the number must be new for GAPipeline 1.4. We are about to finish a multiplex run, so I will check what our files look like and let you know. But I suspect you are right and that it is the barcode for the multiplex. I think they are usually 6 or 7 base pairs long.
Torst is offline   Reply With Quote
Old 08-04-2009, 02:34 AM   #12
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 135
Default

They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).

I'm not really sure what to make of this notation though. They don't seem entirely consistent between file formats either. I've seen other files that had #0/1, implying it's a number and not a string.
jkbonfield is offline   Reply With Quote
Old 08-04-2009, 06:21 PM   #13
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Quote:
Originally Posted by jkbonfield View Post
They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).
From the manual:

The split_on_index.py script identifies all read index sequences that are identical to the reference index sequences, or that differ by a user-defined number of bases. It then breaks up the rows of the export.txt or sorted.txt file and places each row into a separate file, one for each sample.

In order for this process to work, you need the following:

* All samples in a lane are aligned to the same target sequences. The output will be stored in the GERALD directory in export.txt and sorted.txt files.

* A sample sheet, which is an xml configuration file entered during cluster generation. The sample sheet associates index sequences with sample IDs

Sounds like the right tool for the job?
Torst is offline   Reply With Quote
Old 08-04-2009, 11:23 PM   #14
ohlsson
Junior Member
 
Location: Uppsala, Sweden

Join Date: Jun 2009
Posts: 4
Default

Ah, interesting! I will try to find that python script and see how it works.

I already coded a pretty simple perl script that separates reads by exact matching of the header tag to a list of barcodes. It seems to work pretty well: for a mixture of four indexed samples, roughly one fifth of the mixture was sorted to each of the four used barcodes, and one fifth was left unsorted (due to mismatches, so yes jkbonfield, I also think that the tag in the header is sequenced DNA).
Interestingly, each of the eight unused barcodes got only a few hits, in the region of 1-20 reads (out of ~20 million), so the number of false-positives was very low.
ohlsson is offline   Reply With Quote
Reply

Tags
fastq, illumin

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:12 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.