SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Can the upcoming Sandy Bridge i7 Extreme assemble a genome? ymc Bioinformatics 30 06-06-2012 07:38 AM
help. Casava 1.8 demultiplexing senpeng Illumina/Solexa 1 09-19-2011 08:40 AM
CASAVA v1.8 with indels tonio100680 Bioinformatics 3 08-19-2011 05:53 AM
Demultiplexing and CASAVA 1.7 tonio100680 Bioinformatics 14 06-16-2011 11:48 PM
Upcoming in 2009? dsturgill Events / Conferences 1 11-07-2008 02:41 AM

Reply
 
Thread Tools
Old 06-28-2011, 12:53 PM   #61
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,088
Default

Quote:
Originally Posted by skruglyak View Post
Hi Natalie,

there have been some improvements to the chemistry and a refinement of the quality model. As a result, we are now starting to see Q41. There is no additional meaning behind the "J".

Thanks,

Semyon
Hi Semyon,

What is the new numeric range for scores for the v.3. chemistry? Should this be considered Phred+45 mapping now?

Thanks in advance
GenoMax is offline   Reply With Quote
Old 06-28-2011, 02:18 PM   #62
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by GenoMax View Post
Hi Semyon,

What is the new numeric range for scores for the v.3. chemistry? Should this be considered Phred+45 mapping now?

Thanks in advance

HI,

if you are asking about the offset, we are Phred + 33. We recently made the change from +64 to +33 to match the original Sanger convention and we will not move away from this.

Our latest quality calibration data showed an upper bound of 41. We hope to push this higher in the future.

Thanks,
Semyon
skruglyak is offline   Reply With Quote
Old 07-18-2011, 07:57 AM   #63
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 509
Default

Hi Semyon,

It doesn't appear that the offsets have changed, at least in our copy of CASAVA v1.8. I just checked the export.txt files from our last run, and the Q-scores range as high as 'i' (which, with +33 offset, would be 72!). Is there a flag we should be using to generate Phred+33?

Thanks,
Harold

P.S.-I just discovered that the bam files contain Phred+33 Q-scores, while the exports contain Phred+64. The good news: our existing pipeline takes export format as input, so I don't need to change anything. The bad news: I already changed it in anticipation of Phred+33...

Last edited by HESmith; 07-18-2011 at 08:34 AM. Reason: new info
HESmith is offline   Reply With Quote
Old 07-18-2011, 09:39 AM   #64
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by HESmith View Post
Hi Semyon,

It doesn't appear that the offsets have changed, at least in our copy of CASAVA v1.8. I just checked the export.txt files from our last run, and the Q-scores range as high as 'i' (which, with +33 offset, would be 72!). Is there a flag we should be using to generate Phred+33?

Thanks,
Harold

P.S.-I just discovered that the bam files contain Phred+33 Q-scores, while the exports contain Phred+64. The good news: our existing pipeline takes export format as input, so I don't need to change anything. The bad news: I already changed it in anticipation of Phred+33...
Hi Harold,

you are correct. The change to +33 was made in FASTQ and BAM but not in export. The following section of my original post tried to address this point.

Thanks,
Semyon

The quality scores are transformed from integer to character so that a string can represent all of the quality scores within a read. In the CASAVA 1.8 release, we employ an ASCII offset of 33, which is the offset used in the Sanger FASTQ format. Illumina has moved away from an Illumina-specific offset, and adopted the Sanger transformation which is standard in the sequencing field For example, a Q30 base that was previously represented by the character ^ will now be represented by the character ?. The new transformation will be evident in the FASTQ file and the BAM file. The old transformation (ASCII offset of 64) will still be used in the export files, but export.txt is intended to be an internal file format.
skruglyak is offline   Reply With Quote
Old 07-19-2011, 04:46 AM   #65
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by skruglyak View Post
..., but export.txt is intended to be an internal file format.
Say that again with a straight face

A file called export is intended for Illumina's internal use only?!
maubp is offline   Reply With Quote
Old 07-19-2011, 06:16 AM   #66
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by maubp View Post
Say that again with a straight face

A file called export is intended for Illumina's internal use only?!
Point well taken!

We need to deal with historical issues in some way. When we decided to adopt community standard file formats, the other file types were relegated to "internal use." It seemed that renaming them at that point would not have been helpful.

Semyon
skruglyak is offline   Reply With Quote
Old 07-19-2011, 07:45 AM   #67
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 509
Default

Hi Semyon,

I understand the legacy issues regarding export files; what I don't understand is why a SINGLE version of CASAVA is producing TWO different Q-score offsets. It's just one more detail to track (i.e., one more opportunity for mistakes to occur).

Harold
HESmith is offline   Reply With Quote
Old 07-19-2011, 04:47 PM   #68
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by HESmith View Post
Hi Semyon,

I understand the legacy issues regarding export files; what I don't understand is why a SINGLE version of CASAVA is producing TWO different Q-score offsets. It's just one more detail to track (i.e., one more opportunity for mistakes to occur).

Harold
Thanks Harold. I appreciate that this is confusing and that the offsets should be the same in all files. It turned out that there were many technical implications to changing export files so we decided not to invest the resources given that we would encourage the use of BAM and FASTQ.

Semyon
skruglyak is offline   Reply With Quote
Old 08-11-2011, 01:38 PM   #69
hrbigelow
Junior Member
 
Location: Boston, MA

Join Date: May 2009
Posts: 4
Default

Quote:
Originally Posted by skruglyak View Post
Hi everyone,

my name is Semyon and I work in Bioinformatics at Illumina. Our team has prepared a document describing the major changes planned in CASAVA 1.8. The document is available at iCom and attached to this post. I will do my best to follow the thread and answer any questions that you may have. Early access of the release is planned for late February.

The key changes are:

1. The bcl converter will be distributed with CASAVA.
2. The converter will produce compressed FASTQ files rather than qseq files.
3. The FASTQ quality score encoding will use the standard offset value of 33 rather than the previous Illumina-specific offset value of 64.
4. If samples have been multiplexed in a sequencing run using indexing, the converter will also perform demultiplexing.
5. The output files will be in a directory structure organized by project and sample rather than lane and tile.
6. The GERALD summary file will be modified in accordance with the new directory structure.
7. The sequence output of post-alignment analysis will be a set of BAM files.

Thanks!
Hi Semyon,

Just read through the Casava 1.8 attached. I wondered a few things:

1. What is the naming convention for flow cells -- are they limited to [A-Z0-9] or [A-Za-z0-9], perhaps? Nothing is specified in the Wikipedia Fastq entry.

2. Are flow cell ids as purchased from Illumina meant to be globally unique, or is it possible for two flow cells or two runs on physically separate flow cells to somehow end up with the same flow cell ID?

3. Are xpos and ypos in pixels, and is it guaranteed that distinct reads coming from the same flow cell will have a distinct combination of lane,tile,xpos,ypos? Secondly, what is a reasonable conservative upper limit on the values of xpos and ypos?

4. In http://en.wikipedia.org/wiki/FASTQ_format, the pre-Casava 1.8 read id format has the 'unique instrument name' as the first field. But, I have seen fastq from Illumina which is not in Casava 1.8 format, but does seem to have a flow cell ID as its first field. I've also seen data that has 'instrument-name_flowcell-id' as the first field.

The data with only instrument name as the first identifier will collide with data from a different flowcell but run on the same machine. I am glad that the flowcell is now explicitly part of the Casava 1.8 spec.

5. I had thought there were 100 tiles (50 x 2) within each flowcell lane, but I have also encountered fastq IDs like this:

@HWI-ST630:1:1101:1209:2187#0/1

where the tile number is '1101'. Is there any specification on the maximum value of the 'tile' field?



Ultimately as you may have guessed, I'm interested in using this information to implement a perfect hashing of the read id string, so that reads may be efficiently sorted by ID without many millions of long string comparisons.

But more generally, I would greatly benefit from a well defined spec that guarantees global uniqueness of reads (in the world ) and also allows one to write such a hash function for them.

Thanks in advance!

Best,

Henry
hrbigelow is offline   Reply With Quote
Old 08-12-2011, 01:47 AM   #70
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

After some runs analyzed with CASAVA 1.8 I have the some considerations. I was a little skeptic about fastq in place of qseq, especially because the PF information was coded as a column (that I could easily filter with awk) while now is in the sequence ID. We've dropped any srf reference and decided to give fastq.gz a try. CASAVA official documents state I could filter QC-fails just like this:

Code:
for fastq in *.fastq.gz ; do zcat $fastq | grep
-A 4 '^@.* [^:]*:N:[^:]*:' > filtered_$fastq ; done
Unfortunately doesn't work... this does:

Code:
for fastq in *.fastq.gz ; do zgrep
-A 3 '^@.* [^:]*:N:[^:]*:' $fastq | grep -v -- '^--$' > filtered_$fastq ; done
This said, we've decide to align everything and skip the filtering. Once we have a SAM file, we use this simple awk command:

Code:
awk '{OFS="\t"; if(/:Y:/) $2=$2+512; print $0}'
to mark the QC failing reads just in the alignment file.
I should say we use bwa (and not ELAND) for alignments. Unfortunately bwa reads sequence ID in fastq as words and retains only the first one. This trims the QC info (because Y and N are just after a white space). This is a minor issue: we typically pipe fastq to bwa, now we just add a pipe module that translates spaces to underscores:

Code:
bwa aln GENOME <(zcat FILE.fastq.gz | sed -e "s/ /_/")
I was also skeptic about chunked fastq, especially when I had to deliver data. On the other side I found them very useful when running on a cluster: I can easily spawn multiple alignments to nodes without any additional preprocessing.
dawe is offline   Reply With Quote
Old 08-12-2011, 01:42 PM   #71
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by dawe View Post
After some runs analyzed with CASAVA 1.8 I have the some considerations. I was a little skeptic about fastq in place of qseq, especially because the PF information was coded as a column (that I could easily filter with awk) while now is in the sequence ID. We've dropped any srf reference and decided to give fastq.gz a try. CASAVA official documents state I could filter QC-fails just like this:

Code:
for fastq in *.fastq.gz ; do zcat $fastq | grep
-A 4 '^@.* [^:]*:N:[^:]*:' > filtered_$fastq ; done
Unfortunately doesn't work... this does:

Code:
for fastq in *.fastq.gz ; do zgrep
-A 3 '^@.* [^:]*:N:[^:]*:' $fastq | grep -v -- '^--$' > filtered_$fastq ; done
Yeah, that's typical of Illumina software. No one double-checks to see if stuff works right. On the other hand, bwa doesn't seem to mind the '--', though other programs might.

Quote:
I was also skeptic about chunked fastq, especially when I had to deliver data. On the other side I found them very useful when running on a cluster: I can easily spawn multiple alignments to nodes without any additional preprocessing.
You also know that bwa supports multi-threading, so instead of running multiple copies of bwa on each file, you can run bwa on one file using multiple processors. I didn't wait around to see if this actually worked properly but it didn't scream at me when I tried it:

Code:
bwa/0.5.9/bwa aln reference.fa *R1*.fastq.gz > f.out
Which saves a short step of catting everything together. But as you mentioned, the above command will shear off the passing filtering info on the read names.

[edit], Actually, I don't think that last bit will work. A colleague told me that it will only take the first fastq, not all of them. And in the sampe step, it wants a fastq file name there too, and I don't think it will take a list of them, so you do have to make one file in the end if you use bwa.

Last edited by swbarnes2; 08-15-2011 at 03:51 PM.
swbarnes2 is offline   Reply With Quote
Old 08-16-2011, 05:14 PM   #72
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by hrbigelow View Post
Hi Semyon,

Just read through the Casava 1.8 attached. I wondered a few things:

1. What is the naming convention for flow cells -- are they limited to [A-Z0-9] or [A-Za-z0-9], perhaps? Nothing is specified in the Wikipedia Fastq entry.

A: They are limited to [A-Z0-9]

2. Are flow cell ids as purchased from Illumina meant to be globally unique, or is it possible for two flow cells or two runs on physically separate flow cells to somehow end up with the same flow cell ID?

A: Each flow cell has a unique ID.

3. Are xpos and ypos in pixels, and is it guaranteed that distinct reads coming from the same flow cell will have a distinct combination of lane,tile,xpos,ypos? Secondly, what is a reasonable conservative upper limit on the values of xpos and ypos?

A: Yes, the lane, tile, xpos, ypos are unique to a specific read, from the same flow cell. The upper limits will be dependent on the type of flow cell and instrument you are using GA vs HiSeq and v2 vs v3 flow cell. Do you need the upper bound for a particular instrument / flow cell?


4. In http://en.wikipedia.org/wiki/FASTQ_format, the pre-Casava 1.8 read id format has the 'unique instrument name' as the first field. But, I have seen fastq from Illumina which is not in Casava 1.8 format, but does seem to have a flow cell ID as its first field. I've also seen data that has 'instrument-name_flowcell-id' as the first field.

The data with only instrument name as the first identifier will collide with data from a different flowcell but run on the same machine. I am glad that the flowcell is now explicitly part of the Casava 1.8 spec.

A: That should not be a problem because the second field is the run number, which is incremented each run on an instrument.

5. I had thought there were 100 tiles (50 x 2) within each flowcell lane, but I have also encountered fastq IDs like this:

@HWI-ST630:1:1101:1209:2187#0/1

where the tile number is '1101'. Is there any specification on the maximum value of the 'tile' field?

A: The 100 tiles were for the old GAII flow cells. The current GAIIx flow cells have 120 tiles, numbered sequentially from 1-120. The layout and numbering of HiSeq/HiScan-SQ tiles is different. The tile number encodes certain information: Top/Bottom surface (1st digit = 1 or 2) + Swath (2nd digit 1,2 or 3) + Tile number (the 2 last digits from 01 to 08). Looking at your example the 1101 means the first tile of the first swath of the top surface.



Ultimately as you may have guessed, I'm interested in using this information to implement a perfect hashing of the read id string, so that reads may be efficiently sorted by ID without many millions of long string comparisons.

But more generally, I would greatly benefit from a well defined spec that guarantees global uniqueness of reads (in the world ) and also allows one to write such a hash function for them.

Thanks in advance!

Best,

Henry
Hi Henry,

I obtained some answers to your questions from my colleagues. Please see answers in text above.

thanks,
Semyon
skruglyak is offline   Reply With Quote
Old 11-03-2011, 05:48 AM   #73
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by maubp View Post
+1

My only concern is that the read names in the FASTQ files will not include the /1 or /2 suffix. This means both the forward and reverse reads get the same identifier, with the number (1 or 2) in the read description (i.e. in the @ line but after a white space). There are nice symmetries with the SAM/BAM format. However, this will mean any existing scripts/tools/pipelines expecting the suffices will need changing.
It is done now, but I'm not alone in thinking changing the /1 and /2 pair naming was a mistake:
http://www.freelists.org/post/mira_t...em-with-Mira,9

I encourage Illumina to move to producing their raw reads as unaligned SAM/BAM in future, where there are clear metadata structures for paired ends etc:
http://blastedbio.blogspot.com/2011/...ve-sambam.html
maubp is offline   Reply With Quote
Old 11-03-2011, 03:06 PM   #74
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by maubp View Post
It is done now, but I'm not alone in thinking changing the /1 and /2 pair naming was a mistake:
http://www.freelists.org/post/mira_t...em-with-Mira,9

I encourage Illumina to move to producing their raw reads as unaligned SAM/BAM in future, where there are clear metadata structures for paired ends etc:
http://blastedbio.blogspot.com/2011/...ve-sambam.html
Yes, there were strong opinions on both sides of the read naming issue. At the time, unaligned BAM was not supported input to the popular aligners. The format has been getting wider acceptance and I see the value of providing it as an option in the future.
skruglyak is offline   Reply With Quote
Old 11-03-2011, 03:32 PM   #75
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by skruglyak View Post
Yes, there were strong opinions on both sides of the read naming issue. At the time, unaligned BAM was not supported input to the popular aligners. The format has been getting wider acceptance and I see the value of providing it as an option in the future.
Excellent - that's what I was hoping you would say
maubp is offline   Reply With Quote
Old 11-04-2011, 02:42 AM   #76
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Quote:
Originally Posted by skruglyak View Post
Yes, there were strong opinions on both sides of the read naming issue. At the time, unaligned BAM was not supported input to the popular aligners. The format has been getting wider acceptance and I see the value of providing it as an option in the future.
What is the "other side" of "both sides"?

We are running three HiSeqs and a few GAs; reading and rewriting a few hundred gigabytes of compressed sequence data just to fix a deficient header is quite annoying IMHO.

I do agree SAM would be a nice option for data storage (it should probably not replace fastq yet, many people do still use fastq as input for their programs).
If it very wise to use a binary (sequencing specific) storage format like BAM ... I don't know, just a bad feeling :-)

Strange enough (never mentioned) ... lots of IT folks would appreciate if the "we create many, many files" madness would be limited to some reasonable number.
1,629,325 files for a 2x120 run is by far too much ...

just my 2p,
Sven

Last edited by sklages; 11-04-2011 at 06:18 AM. Reason: typos
sklages is offline   Reply With Quote
Old 11-08-2011, 10:16 AM   #77
afaghalavi
Junior Member
 
Location: Tehran

Join Date: Sep 2011
Posts: 5
Default

Hello Dear Sir/Madam

We received our exome data and now i have 2 files (snps and indels) in text format.
I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! can I use annovar for its analysis and anotation??

#$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used
chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0
chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0
chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0
chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0


Best
afaghalavi is offline   Reply With Quote
Old 11-09-2011, 05:45 AM   #78
Orr Shomroni
Member
 
Location: Netherlands

Join Date: Oct 2011
Posts: 26
Default

Thanks for the tip on the filtering, dawe. Our previous filtering resulted with only headers for 'Y' reads and -- as body, and apperently that wasn't much of an issue. Still, the new command makes it look cleaner.

One thing troubles me, though. I am trying to run the filtered files on FastQC, but I'm getting an error that the filtered fastq files are not in gz format. When I try to compress them, it says it cannot, because they are already in .gz format; when I try to decompress them, I get an error because the files are not GZIP files.

I imagine there should be an easy way to modify the extension for the filtered fastq file, but I am not sure how to do that within the "for" loop
__________________
"Though it may seem that all's been said and done, originality still lives on" - some unoriginal guy who had nothing better to write as his signature
Orr Shomroni is offline   Reply With Quote
Old 11-09-2011, 07:42 AM   #79
Orr Shomroni
Member
 
Location: Netherlands

Join Date: Oct 2011
Posts: 26
Default

Ok, I solved the problem. Maybe I missed it, but this situation only applies if you are dealing with uncompressed fastq files to begin with. The filtering process necessarily returns an unzipped file, so the filename has to be adjusted and the file has to be compressed
__________________
"Though it may seem that all's been said and done, originality still lives on" - some unoriginal guy who had nothing better to write as his signature
Orr Shomroni is offline   Reply With Quote
Old 12-13-2011, 11:12 AM   #80
olus
Member
 
Location: milan, italy

Join Date: Aug 2008
Posts: 22
Default

Quote:
Originally Posted by sparks View Post
Hi,
V1.8 has some extra fields:
<is filtered> is Y if the read is filtered, N otherwise.
<control number> is 0 when none of the control bits are on, otherwise it is an even number.
Does anyone know what these are for?
Is is_filtered reminiscent of QSEQ quality flag and if so does 'Y' mean high or low quality?

Colin
Hi Colin.
Did find out what
<control number>
in '@' FASTQ line is used for?

Except the light definition in the official pdf I couldn't find any suggestion.

If anybody could give me some hints it would be really appreciated!

Gabriele
__________________
gabriele bucci
olus is offline   Reply With Quote
Reply

Tags
casava, illumina, secondary analysis

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:02 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO