SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Default Change in CASAVA / BCL->FASTQ (http://seqanswers.com/forums/showthread.php?t=14316)

skruglyak 09-23-2011 01:32 PM

Default Change in CASAVA / BCL->FASTQ
 
We are planning a minor release of CASAVA in October that is primarily intended to handle an improvement to the number of supported index sequences. In the same release, we plan to change the default behavior and omit reads that do not pass filter from the FASTQ files. In general, we do not recommend the use of non-PF reads. Users that want to retain the non-PF reads will be able to do so by adding the following parameter to the configureBcltoFastq.pl:

--with-failed-reads

A read is classified as non-PF when more than one cycle in the first 25 cycles has a poor ratio (<0.6) of the brightest intensity to the sum of the brightest and second brightest.
Our variant calling software ignores non-PF reads, but there are many alternate methods that use all data, disregarding the non-PF flag. The inclusion of non-PF reads increases time to align, increases the data footprint, increases the measured error rate, and can lead to variant calling errors. As a result we have decided to exclude such reads as the default behavior. As a consequence of being excluded from the FASTQ files, the reads will also be excluded from all downstream processing and output including BAM files archival and standard.

Please let me know if you have questions or concerns.

Thank you,
Semyon

kmcarr 09-23-2011 01:40 PM

Thank you.

GenoMax 09-24-2011 12:56 PM

This saves us from having to add steps post standard CASAVA processing. Thanks.

simonandrews 09-26-2011 05:23 AM

Thanks for changing this - this will help us a lot.

If there's any chance you could add a switch to eland to make it write out a single bam file containing just the alignments (the equivalent to the old sorted files) when it is run as part of a pipeline then we'd have all of the functionality back in 1.8 which we had in previous versions, with the benefits of smaller more standard output files.

msincan 10-01-2011 11:44 AM

What are the alternate methods that use all data, disregarding the non-PF flag?

skruglyak 10-03-2011 02:25 PM

Quote:

Originally Posted by msincan (Post 52783)
What are the alternate methods that use all data, disregarding the non-PF flag?

A common aligner such as BWA will not recognize the filter flag in our FASTQ file. As a result, the BAM bitwise flag that reflects "not passing quality controls" will not be set. Any variant caller (samtools or GATK) will end up using all of the data.

Thanks,

Semyon

maubp 10-17-2011 09:02 AM

Hi Semyon

Could you clarify if the unaligned reads in SAM/BAM format from Illumina CASAVA 1.8.x have FLAG bit 0x200 (not passing quality controls) set according to your non-PF QC?

Thanks,

Peter

mcrusch 10-17-2011 09:39 AM

Thank you for this change!

Is this patch released yet, or is there a more specific ETA?

skruglyak 10-17-2011 10:02 AM

Quote:

Originally Posted by maubp (Post 54112)
Hi Semyon

Could you clarify if the unaligned reads in SAM/BAM format from Illumina CASAVA 1.8.x have FLAG bit 0x200 (not passing quality controls) set according to your non-PF QC?

Thanks,

Peter

Hi Peter,

There is a distinction between unaligned reads and non-PF reads.
In CASAVA 1.8 all non-PF reads in the BAM output have the "not passing quality controls" flag bit set (0x200). Note that this setting is independent of alignment -- unaligned reads are indicated with the conventional "segment unmapped" flag bit (0x004). Starting in 1.8.2,the default behavior will be to exclude non-PF reads entirely as explained earlier in the post.

Thanks,
Semyon

maubp 10-17-2011 10:10 AM

Thanks.

Apologies if I was unclear - I was trying to distinguish raw reads in SAM/BAM (all reads unaligned) from a finished assembly/mapping in SAM/BAM (where most of the reads are aligned).

skruglyak 10-17-2011 10:31 AM

Quote:

Originally Posted by maubp (Post 54123)
Thanks.

Apologies if I was unclear - I was trying to distinguish raw reads in SAM/BAM (all reads unaligned) from a finished assembly/mapping in SAM/BAM (where most of the reads are aligned).


Sorry if I still misunderstand... CASAVA produces FASTQ files (not unaligned BAMs). The only BAM files produced are post alignment.

maubp 10-17-2011 10:42 AM

Oh. I was under the (wrong?) impression that Illumina was looking at producing unaligned SAM/BAM as an output alternative. This idea is attractive because it has explicit standards for things like QC flags, and other things like read pairings - rather than the current pain where the precise encoding of this meta information into the FASTQ free text seems to change far too often.

Perhaps I'd misheard the news that Illumina was doing post-alignment output as SAM/BAM.

Update: See this blog post and this thread for more about unaligned SAM/BAM as an alternative to FASTQ.

skruglyak 10-19-2011 02:54 PM

Quote:

Originally Posted by mcrusch (Post 54117)
Thank you for this change!

Is this patch released yet, or is there a more specific ETA?


1.8.2 is available starting today.

Thanks,

Semyon

kmcarr 10-20-2011 07:27 AM

Quote:

Originally Posted by skruglyak (Post 54435)
1.8.2 is available starting today.

Thanks,

Semyon

Semyon,

Thanks for letting us know.

I do have one nit to pick however. Starting with CASAVA 1.8 the download tarball contains a massive validation data set, 1.5GB (>90% of the uncompressed size). Would it be possible to separate the code from the sample data for the folks who don't want to spend a couple of hours downloading the software. It's particularly irksome since I've now downloaded exactly the same data set 3 times (with 1.8.0, 1.8.1 and now 1.8.2).

Thanks.

simonandrews 10-20-2011 07:43 AM

Quote:

Originally Posted by kmcarr (Post 54483)
Would it be possible to separate the code from the sample data

Yes - if you could change this it would be great! Very few people will bother running the validation data and those that want to will be happy to download it. We got awful transfer rates from illumina.com (presumably pulling data over a transatlantic link), and downloading the last Casava update took the best part of a day.

kmcarr 10-24-2011 06:39 AM

Quote:

Originally Posted by skruglyak (Post 52126)
We are planning a minor release of CASAVA in October that is primarily intended to handle an improvement to the number of supported index sequences. In the same release, we plan to change the default behavior and omit reads that do not pass filter from the FASTQ files. In general, we do not recommend the use of non-PF reads. Users that want to retain the non-PF reads will be able to do so by adding the following parameter to the configureBcltoFastq.pl:

--with-failed-reads

A read is classified as non-PF when more than one cycle in the first 25 cycles has a poor ratio (<0.6) of the brightest intensity to the sum of the brightest and second brightest.
Our variant calling software ignores non-PF reads, but there are many alternate methods that use all data, disregarding the non-PF flag. The inclusion of non-PF reads increases time to align, increases the data footprint, increases the measured error rate, and can lead to variant calling errors. As a result we have decided to exclude such reads as the default behavior. As a consequence of being excluded from the FASTQ files, the reads will also be excluded from all downstream processing and output including BAM files archival and standard.

Please let me know if you have questions or concerns.

Thank you,
Semyon

Semyon,

I really appreciate that Illumina has been so responsive to customer feedback with regard to refinement of the CASAVA pipeline and I really hate to keep coming up with more things to tweak/change, but...

I just ran my first data set through the new 1.8.2 pipeline and truly appreciate the PF only default and --fastq-cluster-count 0 options, however I noted what I consider a bug in some of the summary files produced by CASAVA. Some summary files (e.g. Flowcell_demux_summary.xml) report the number of PF clusters/bases for for both raw and PF counts. Other files (e.g. BustardSummary.xml) appear to correctly report raw and PF.

Thanks again.

selen 11-14-2011 08:44 AM

A single bam file as alignment output
 
Dear Semyon,

Is there a way to get alignments in a single file per sample in bam format as alignment output?

As far as I know we need an additional "configurebuild --targets sort bam " step to achieve it right now.

Thanks

skruglyak 11-14-2011 10:41 AM

Quote:

Originally Posted by kmcarr (Post 54739)
Semyon,

I really appreciate that Illumina has been so responsive to customer feedback with regard to refinement of the CASAVA pipeline and I really hate to keep coming up with more things to tweak/change, but...

I just ran my first data set through the new 1.8.2 pipeline and truly appreciate the PF only default and --fastq-cluster-count 0 options, however I noted what I consider a bug in some of the summary files produced by CASAVA. Some summary files (e.g. Flowcell_demux_summary.xml) report the number of PF clusters/bases for for both raw and PF counts. Other files (e.g. BustardSummary.xml) appear to correctly report raw and PF.

Thanks again.

Sorry for the late reply. I somehow missed notification of the post. You are correct. The stats are computed after the FASTQ file is made, so this leads to the issue that you observe. Have you tried using SAV (sequence analysis viewer)? It reports a lot of valuable statistics created by RTA, including %PF.

Thanks for your feedback.

Semyon

skruglyak 11-14-2011 10:45 AM

Quote:

Originally Posted by selen (Post 56684)
Dear Semyon,

Is there a way to get alignments in a single file per sample in bam format as alignment output?

As far as I know we need an additional "configurebuild --targets sort bam " step to achieve it right now.

Thanks

Hi selen,

You are correct to use configureBuild to generate the single BAM file. I spoke with a member of my team and he provided the following example.

Thanks,
Semyon

$CASAVA_PATH/bin/configureBuild.pl \
--outDir ./outdir \
--inSampleDir /path/to/eland_alignment/Sample_exampleSample \ --samtoolsRefFile genome.fa \ --targets sort bam \ --sortKeepAllReads

kmcarr 11-15-2011 11:06 AM

Quote:

Originally Posted by skruglyak (Post 56687)
Sorry for the late reply. I somehow missed notification of the post. You are correct. The stats are computed after the FASTQ file is made, so this leads to the issue that you observe. Have you tried using SAV (sequence analysis viewer)? It reports a lot of valuable statistics created by RTA, including %PF.

Thanks for your feedback.

Semyon

Semyon,

Yes, it's true SAV presents some of that data, but I need the data in a format that I can parse to generate reports. This means the .xml files produced by CASAVA. The files produced by CASAVA really should properly report the number of Raw and PF clusters generated regardless of what is output to the FASTQ files.


All times are GMT -8. The time now is 05:26 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.