SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Casava 1.8 Fastq format for using with BWA a14418e10 Bioinformatics 3 11-11-2011 10:13 AM
BCL to FASTQ conversation for paired-end RNA data seqmonkey Illumina/Solexa 8 09-27-2011 12:02 PM
Conversion from bcl format to fastq files kjaja Bioinformatics 5 09-14-2011 08:13 AM
CASAVA v1.8 (Bcl to Fastq) Kacper Illumina/Solexa 2 08-04-2011 10:08 PM
Help with FastQ/CASAVA format problems Airwalker810 Bioinformatics 4 01-12-2011 09:20 AM

Reply
 
Thread Tools
Old 09-23-2011, 01:32 PM   #1
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default Default Change in CASAVA / BCL->FASTQ

We are planning a minor release of CASAVA in October that is primarily intended to handle an improvement to the number of supported index sequences. In the same release, we plan to change the default behavior and omit reads that do not pass filter from the FASTQ files. In general, we do not recommend the use of non-PF reads. Users that want to retain the non-PF reads will be able to do so by adding the following parameter to the configureBcltoFastq.pl:

--with-failed-reads

A read is classified as non-PF when more than one cycle in the first 25 cycles has a poor ratio (<0.6) of the brightest intensity to the sum of the brightest and second brightest.
Our variant calling software ignores non-PF reads, but there are many alternate methods that use all data, disregarding the non-PF flag. The inclusion of non-PF reads increases time to align, increases the data footprint, increases the measured error rate, and can lead to variant calling errors. As a result we have decided to exclude such reads as the default behavior. As a consequence of being excluded from the FASTQ files, the reads will also be excluded from all downstream processing and output including BAM files archival and standard.

Please let me know if you have questions or concerns.

Thank you,
Semyon
skruglyak is offline   Reply With Quote
Old 09-23-2011, 01:40 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,155
Default

Thank you.
kmcarr is offline   Reply With Quote
Old 09-24-2011, 12:56 PM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,814
Default

This saves us from having to add steps post standard CASAVA processing. Thanks.
GenoMax is offline   Reply With Quote
Old 09-26-2011, 05:23 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Thanks for changing this - this will help us a lot.

If there's any chance you could add a switch to eland to make it write out a single bam file containing just the alignments (the equivalent to the old sorted files) when it is run as part of a pipeline then we'd have all of the functionality back in 1.8 which we had in previous versions, with the benefits of smaller more standard output files.
simonandrews is offline   Reply With Quote
Old 10-01-2011, 11:44 AM   #5
msincan
Member
 
Location: Maryland

Join Date: Dec 2009
Posts: 19
Default

What are the alternate methods that use all data, disregarding the non-PF flag?
msincan is offline   Reply With Quote
Old 10-03-2011, 02:25 PM   #6
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by msincan View Post
What are the alternate methods that use all data, disregarding the non-PF flag?
A common aligner such as BWA will not recognize the filter flag in our FASTQ file. As a result, the BAM bitwise flag that reflects "not passing quality controls" will not be set. Any variant caller (samtools or GATK) will end up using all of the data.

Thanks,

Semyon
skruglyak is offline   Reply With Quote
Old 10-17-2011, 09:02 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Hi Semyon

Could you clarify if the unaligned reads in SAM/BAM format from Illumina CASAVA 1.8.x have FLAG bit 0x200 (not passing quality controls) set according to your non-PF QC?

Thanks,

Peter
maubp is offline   Reply With Quote
Old 10-17-2011, 09:39 AM   #8
mcrusch
Junior Member
 
Location: Tennessee

Join Date: Mar 2011
Posts: 6
Default

Thank you for this change!

Is this patch released yet, or is there a more specific ETA?
mcrusch is offline   Reply With Quote
Old 10-17-2011, 10:02 AM   #9
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by maubp View Post
Hi Semyon

Could you clarify if the unaligned reads in SAM/BAM format from Illumina CASAVA 1.8.x have FLAG bit 0x200 (not passing quality controls) set according to your non-PF QC?

Thanks,

Peter
Hi Peter,

There is a distinction between unaligned reads and non-PF reads.
In CASAVA 1.8 all non-PF reads in the BAM output have the "not passing quality controls" flag bit set (0x200). Note that this setting is independent of alignment -- unaligned reads are indicated with the conventional "segment unmapped" flag bit (0x004). Starting in 1.8.2,the default behavior will be to exclude non-PF reads entirely as explained earlier in the post.

Thanks,
Semyon
skruglyak is offline   Reply With Quote
Old 10-17-2011, 10:10 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Thanks.

Apologies if I was unclear - I was trying to distinguish raw reads in SAM/BAM (all reads unaligned) from a finished assembly/mapping in SAM/BAM (where most of the reads are aligned).
maubp is offline   Reply With Quote
Old 10-17-2011, 10:31 AM   #11
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by maubp View Post
Thanks.

Apologies if I was unclear - I was trying to distinguish raw reads in SAM/BAM (all reads unaligned) from a finished assembly/mapping in SAM/BAM (where most of the reads are aligned).

Sorry if I still misunderstand... CASAVA produces FASTQ files (not unaligned BAMs). The only BAM files produced are post alignment.
skruglyak is offline   Reply With Quote
Old 10-17-2011, 10:42 AM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Oh. I was under the (wrong?) impression that Illumina was looking at producing unaligned SAM/BAM as an output alternative. This idea is attractive because it has explicit standards for things like QC flags, and other things like read pairings - rather than the current pain where the precise encoding of this meta information into the FASTQ free text seems to change far too often.

Perhaps I'd misheard the news that Illumina was doing post-alignment output as SAM/BAM.

Update: See this blog post and this thread for more about unaligned SAM/BAM as an alternative to FASTQ.

Last edited by maubp; 10-21-2011 at 05:51 AM.
maubp is offline   Reply With Quote
Old 10-19-2011, 02:54 PM   #13
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by mcrusch View Post
Thank you for this change!

Is this patch released yet, or is there a more specific ETA?

1.8.2 is available starting today.

Thanks,

Semyon
skruglyak is offline   Reply With Quote
Old 10-20-2011, 07:27 AM   #14
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,155
Default

Quote:
Originally Posted by skruglyak View Post
1.8.2 is available starting today.

Thanks,

Semyon
Semyon,

Thanks for letting us know.

I do have one nit to pick however. Starting with CASAVA 1.8 the download tarball contains a massive validation data set, 1.5GB (>90% of the uncompressed size). Would it be possible to separate the code from the sample data for the folks who don't want to spend a couple of hours downloading the software. It's particularly irksome since I've now downloaded exactly the same data set 3 times (with 1.8.0, 1.8.1 and now 1.8.2).

Thanks.
kmcarr is offline   Reply With Quote
Old 10-20-2011, 07:43 AM   #15
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by kmcarr View Post
Would it be possible to separate the code from the sample data
Yes - if you could change this it would be great! Very few people will bother running the validation data and those that want to will be happy to download it. We got awful transfer rates from illumina.com (presumably pulling data over a transatlantic link), and downloading the last Casava update took the best part of a day.
simonandrews is offline   Reply With Quote
Old 10-24-2011, 06:39 AM   #16
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,155
Default

Quote:
Originally Posted by skruglyak View Post
We are planning a minor release of CASAVA in October that is primarily intended to handle an improvement to the number of supported index sequences. In the same release, we plan to change the default behavior and omit reads that do not pass filter from the FASTQ files. In general, we do not recommend the use of non-PF reads. Users that want to retain the non-PF reads will be able to do so by adding the following parameter to the configureBcltoFastq.pl:

--with-failed-reads

A read is classified as non-PF when more than one cycle in the first 25 cycles has a poor ratio (<0.6) of the brightest intensity to the sum of the brightest and second brightest.
Our variant calling software ignores non-PF reads, but there are many alternate methods that use all data, disregarding the non-PF flag. The inclusion of non-PF reads increases time to align, increases the data footprint, increases the measured error rate, and can lead to variant calling errors. As a result we have decided to exclude such reads as the default behavior. As a consequence of being excluded from the FASTQ files, the reads will also be excluded from all downstream processing and output including BAM files archival and standard.

Please let me know if you have questions or concerns.

Thank you,
Semyon
Semyon,

I really appreciate that Illumina has been so responsive to customer feedback with regard to refinement of the CASAVA pipeline and I really hate to keep coming up with more things to tweak/change, but...

I just ran my first data set through the new 1.8.2 pipeline and truly appreciate the PF only default and --fastq-cluster-count 0 options, however I noted what I consider a bug in some of the summary files produced by CASAVA. Some summary files (e.g. Flowcell_demux_summary.xml) report the number of PF clusters/bases for for both raw and PF counts. Other files (e.g. BustardSummary.xml) appear to correctly report raw and PF.

Thanks again.
kmcarr is offline   Reply With Quote
Old 11-14-2011, 08:44 AM   #17
selen
Junior Member
 
Location: Ohio

Join Date: Dec 2010
Posts: 9
Default A single bam file as alignment output

Dear Semyon,

Is there a way to get alignments in a single file per sample in bam format as alignment output?

As far as I know we need an additional "configurebuild --targets sort bam " step to achieve it right now.

Thanks
selen is offline   Reply With Quote
Old 11-14-2011, 10:41 AM   #18
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by kmcarr View Post
Semyon,

I really appreciate that Illumina has been so responsive to customer feedback with regard to refinement of the CASAVA pipeline and I really hate to keep coming up with more things to tweak/change, but...

I just ran my first data set through the new 1.8.2 pipeline and truly appreciate the PF only default and --fastq-cluster-count 0 options, however I noted what I consider a bug in some of the summary files produced by CASAVA. Some summary files (e.g. Flowcell_demux_summary.xml) report the number of PF clusters/bases for for both raw and PF counts. Other files (e.g. BustardSummary.xml) appear to correctly report raw and PF.

Thanks again.
Sorry for the late reply. I somehow missed notification of the post. You are correct. The stats are computed after the FASTQ file is made, so this leads to the issue that you observe. Have you tried using SAV (sequence analysis viewer)? It reports a lot of valuable statistics created by RTA, including %PF.

Thanks for your feedback.

Semyon
skruglyak is offline   Reply With Quote
Old 11-14-2011, 10:45 AM   #19
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by selen View Post
Dear Semyon,

Is there a way to get alignments in a single file per sample in bam format as alignment output?

As far as I know we need an additional "configurebuild --targets sort bam " step to achieve it right now.

Thanks
Hi selen,

You are correct to use configureBuild to generate the single BAM file. I spoke with a member of my team and he provided the following example.

Thanks,
Semyon

$CASAVA_PATH/bin/configureBuild.pl \
--outDir ./outdir \
--inSampleDir /path/to/eland_alignment/Sample_exampleSample \ --samtoolsRefFile genome.fa \ --targets sort bam \ --sortKeepAllReads
skruglyak is offline   Reply With Quote
Old 11-15-2011, 11:06 AM   #20
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,155
Default

Quote:
Originally Posted by skruglyak View Post
Sorry for the late reply. I somehow missed notification of the post. You are correct. The stats are computed after the FASTQ file is made, so this leads to the issue that you observe. Have you tried using SAV (sequence analysis viewer)? It reports a lot of valuable statistics created by RTA, including %PF.

Thanks for your feedback.

Semyon
Semyon,

Yes, it's true SAV presents some of that data, but I need the data in a format that I can parse to generate reports. This means the .xml files produced by CASAVA. The files produced by CASAVA really should properly report the number of Raw and PF clusters generated regardless of what is output to the FASTQ files.
kmcarr is offline   Reply With Quote
Reply

Tags
casava, fastq, filtered reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO