![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can the upcoming Sandy Bridge i7 Extreme assemble a genome? | ymc | Bioinformatics | 30 | 06-06-2012 07:38 AM |
help. Casava 1.8 demultiplexing | senpeng | Illumina/Solexa | 1 | 09-19-2011 08:40 AM |
CASAVA v1.8 with indels | tonio100680 | Bioinformatics | 3 | 08-19-2011 05:53 AM |
Demultiplexing and CASAVA 1.7 | tonio100680 | Bioinformatics | 14 | 06-16-2011 11:48 PM |
Upcoming in 2009? | dsturgill | Events / Conferences | 1 | 11-07-2008 02:41 AM |
![]() |
|
Thread Tools |
![]() |
#61 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]() Quote:
What is the new numeric range for scores for the v.3. chemistry? Should this be considered Phred+45 mapping now? Thanks in advance |
|
![]() |
![]() |
![]() |
#62 | |
Member
Location: San Diego Join Date: Sep 2010
Posts: 44
|
![]() Quote:
HI, if you are asking about the offset, we are Phred + 33. We recently made the change from +64 to +33 to match the original Sanger convention and we will not move away from this. Our latest quality calibration data showed an upper bound of 41. We hope to push this higher in the future. Thanks, Semyon |
|
![]() |
![]() |
![]() |
#63 |
Senior Member
Location: Bethesda MD Join Date: Oct 2009
Posts: 509
|
![]()
Hi Semyon,
It doesn't appear that the offsets have changed, at least in our copy of CASAVA v1.8. I just checked the export.txt files from our last run, and the Q-scores range as high as 'i' (which, with +33 offset, would be 72!). Is there a flag we should be using to generate Phred+33? Thanks, Harold P.S.-I just discovered that the bam files contain Phred+33 Q-scores, while the exports contain Phred+64. The good news: our existing pipeline takes export format as input, so I don't need to change anything. The bad news: I already changed it in anticipation of Phred+33... Last edited by HESmith; 07-18-2011 at 08:34 AM. Reason: new info |
![]() |
![]() |
![]() |
#64 | |
Member
Location: San Diego Join Date: Sep 2010
Posts: 44
|
![]() Quote:
you are correct. The change to +33 was made in FASTQ and BAM but not in export. The following section of my original post tried to address this point. Thanks, Semyon The quality scores are transformed from integer to character so that a string can represent all of the quality scores within a read. In the CASAVA 1.8 release, we employ an ASCII offset of 33, which is the offset used in the Sanger FASTQ format. Illumina has moved away from an Illumina-specific offset, and adopted the Sanger transformation which is standard in the sequencing field For example, a Q30 base that was previously represented by the character “^” will now be represented by the character “?”. The new transformation will be evident in the FASTQ file and the BAM file. The old transformation (ASCII offset of 64) will still be used in the export files, but export.txt is intended to be an internal file format. |
|
![]() |
![]() |
![]() |
#65 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]() |
![]() |
![]() |
![]() |
#66 | |
Member
Location: San Diego Join Date: Sep 2010
Posts: 44
|
![]() Quote:
We need to deal with historical issues in some way. When we decided to adopt community standard file formats, the other file types were relegated to "internal use." It seemed that renaming them at that point would not have been helpful. Semyon |
|
![]() |
![]() |
![]() |
#67 |
Senior Member
Location: Bethesda MD Join Date: Oct 2009
Posts: 509
|
![]()
Hi Semyon,
I understand the legacy issues regarding export files; what I don't understand is why a SINGLE version of CASAVA is producing TWO different Q-score offsets. It's just one more detail to track (i.e., one more opportunity for mistakes to occur). Harold |
![]() |
![]() |
![]() |
#68 | |
Member
Location: San Diego Join Date: Sep 2010
Posts: 44
|
![]() Quote:
Semyon |
|
![]() |
![]() |
![]() |
#69 | |
Junior Member
Location: Boston, MA Join Date: May 2009
Posts: 4
|
![]() Quote:
Just read through the Casava 1.8 attached. I wondered a few things: 1. What is the naming convention for flow cells -- are they limited to [A-Z0-9] or [A-Za-z0-9], perhaps? Nothing is specified in the Wikipedia Fastq entry. 2. Are flow cell ids as purchased from Illumina meant to be globally unique, or is it possible for two flow cells or two runs on physically separate flow cells to somehow end up with the same flow cell ID? 3. Are xpos and ypos in pixels, and is it guaranteed that distinct reads coming from the same flow cell will have a distinct combination of lane,tile,xpos,ypos? Secondly, what is a reasonable conservative upper limit on the values of xpos and ypos? 4. In http://en.wikipedia.org/wiki/FASTQ_format, the pre-Casava 1.8 read id format has the 'unique instrument name' as the first field. But, I have seen fastq from Illumina which is not in Casava 1.8 format, but does seem to have a flow cell ID as its first field. I've also seen data that has 'instrument-name_flowcell-id' as the first field. The data with only instrument name as the first identifier will collide with data from a different flowcell but run on the same machine. I am glad that the flowcell is now explicitly part of the Casava 1.8 spec. 5. I had thought there were 100 tiles (50 x 2) within each flowcell lane, but I have also encountered fastq IDs like this: @HWI-ST630:1:1101:1209:2187#0/1 where the tile number is '1101'. Is there any specification on the maximum value of the 'tile' field? Ultimately as you may have guessed, I'm interested in using this information to implement a perfect hashing of the read id string, so that reads may be efficiently sorted by ID without many millions of long string comparisons. But more generally, I would greatly benefit from a well defined spec that guarantees global uniqueness of reads (in the world ![]() Thanks in advance! Best, Henry |
|
![]() |
![]() |
![]() |
#70 |
Senior Member
Location: 45°30'25.22"N / 9°15'53.00"E Join Date: Apr 2009
Posts: 258
|
![]()
After some runs analyzed with CASAVA 1.8 I have the some considerations. I was a little skeptic about fastq in place of qseq, especially because the PF information was coded as a column (that I could easily filter with awk) while now is in the sequence ID. We've dropped any srf reference and decided to give fastq.gz a try. CASAVA official documents state I could filter QC-fails just like this:
Code:
for fastq in *.fastq.gz ; do zcat $fastq | grep -A 4 '^@.* [^:]*:N:[^:]*:' > filtered_$fastq ; done Code:
for fastq in *.fastq.gz ; do zgrep -A 3 '^@.* [^:]*:N:[^:]*:' $fastq | grep -v -- '^--$' > filtered_$fastq ; done Code:
awk '{OFS="\t"; if(/:Y:/) $2=$2+512; print $0}' I should say we use bwa (and not ELAND) for alignments. Unfortunately bwa reads sequence ID in fastq as words and retains only the first one. This trims the QC info (because Y and N are just after a white space). This is a minor issue: we typically pipe fastq to bwa, now we just add a pipe module that translates spaces to underscores: Code:
bwa aln GENOME <(zcat FILE.fastq.gz | sed -e "s/ /_/") |
![]() |
![]() |
![]() |
#71 | ||
Senior Member
Location: San Diego Join Date: May 2008
Posts: 912
|
![]() Quote:
Quote:
Code:
bwa/0.5.9/bwa aln reference.fa *R1*.fastq.gz > f.out [edit], Actually, I don't think that last bit will work. A colleague told me that it will only take the first fastq, not all of them. And in the sampe step, it wants a fastq file name there too, and I don't think it will take a list of them, so you do have to make one file in the end if you use bwa. Last edited by swbarnes2; 08-15-2011 at 03:51 PM. |
||
![]() |
![]() |
![]() |
#72 | |
Member
Location: San Diego Join Date: Sep 2010
Posts: 44
|
![]() Quote:
I obtained some answers to your questions from my colleagues. Please see answers in text above. thanks, Semyon |
|
![]() |
![]() |
![]() |
#73 | |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]() Quote:
http://www.freelists.org/post/mira_t...em-with-Mira,9 I encourage Illumina to move to producing their raw reads as unaligned SAM/BAM in future, where there are clear metadata structures for paired ends etc: http://blastedbio.blogspot.com/2011/...ve-sambam.html |
|
![]() |
![]() |
![]() |
#74 | |
Member
Location: San Diego Join Date: Sep 2010
Posts: 44
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#75 | |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
#76 | |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]() Quote:
We are running three HiSeqs and a few GAs; reading and rewriting a few hundred gigabytes of compressed sequence data just to fix a deficient header is quite annoying IMHO. I do agree SAM would be a nice option for data storage (it should probably not replace fastq yet, many people do still use fastq as input for their programs). If it very wise to use a binary (sequencing specific) storage format like BAM ... I don't know, just a bad feeling :-) Strange enough (never mentioned) ... lots of IT folks would appreciate if the "we create many, many files" madness would be limited to some reasonable number. 1,629,325 files for a 2x120 run is by far too much ... just my 2p, Sven Last edited by sklages; 11-04-2011 at 06:18 AM. Reason: typos |
|
![]() |
![]() |
![]() |
#77 |
Junior Member
Location: Tehran Join Date: Sep 2011
Posts: 5
|
![]()
Hello Dear Sir/Madam
We received our exome data and now i have 2 files (snps and indels) in text format. I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! can I use annovar for its analysis and anotation?? #$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0 chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0 chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0 chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0 Best |
![]() |
![]() |
![]() |
#78 |
Member
Location: Netherlands Join Date: Oct 2011
Posts: 26
|
![]()
Thanks for the tip on the filtering, dawe. Our previous filtering resulted with only headers for 'Y' reads and -- as body, and apperently that wasn't much of an issue. Still, the new command makes it look cleaner.
One thing troubles me, though. I am trying to run the filtered files on FastQC, but I'm getting an error that the filtered fastq files are not in gz format. When I try to compress them, it says it cannot, because they are already in .gz format; when I try to decompress them, I get an error because the files are not GZIP files. I imagine there should be an easy way to modify the extension for the filtered fastq file, but I am not sure how to do that within the "for" loop
__________________
"Though it may seem that all's been said and done, originality still lives on" - some unoriginal guy who had nothing better to write as his signature ![]() |
![]() |
![]() |
![]() |
#79 |
Member
Location: Netherlands Join Date: Oct 2011
Posts: 26
|
![]()
Ok, I solved the problem. Maybe I missed it, but this situation only applies if you are dealing with uncompressed fastq files to begin with. The filtering process necessarily returns an unzipped file, so the filename has to be adjusted and the file has to be compressed
__________________
"Though it may seem that all's been said and done, originality still lives on" - some unoriginal guy who had nothing better to write as his signature ![]() |
![]() |
![]() |
![]() |
#80 | |
Member
Location: milan, italy Join Date: Aug 2008
Posts: 22
|
![]() Quote:
Did find out what <control number> in '@' FASTQ line is used for? Except the light definition in the official pdf I couldn't find any suggestion. If anybody could give me some hints it would be really appreciated! Gabriele
__________________
gabriele bucci |
|
![]() |
![]() |
![]() |
Tags |
casava, illumina, secondary analysis |
Thread Tools | |
|
|