SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
illumina smallRNA adapter sequence for downstram analysis + miRNA analysis steps ndeshpan Bioinformatics 2 06-14-2011 09:44 PM
OpGen: Optical Restriction Mapping as replacement for Mate Pair Sequencing? ECO General 2 09-30-2010 06:00 AM
MicroArray and Sequence Analysis BioSlayer Bioinformatics 4 04-06-2010 04:58 PM
Alternative to casava for variation analysis in RNA? tim RNA Sequencing 0 08-19-2009 09:44 AM
Replacement/generic Mock Amplification Mix jsandler 454 Pyrosequencing 0 08-04-2009 01:49 PM

Reply
 
Thread Tools
Old 09-08-2011, 09:00 AM   #21
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by GenoMax View Post
Have you put a new version up that I can try that accounts for the total number of reads going in?
I've not put up a new snapshot but I've changed our development version. The only difference is that there's an extra row in the summary statistics module which says how many sequences were filtered. All of the other stats will be exactly the same as for the version you tested.

An official release should be along soon....
simonandrews is offline   Reply With Quote
Old 09-09-2011, 11:17 AM   #22
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,147
Default

Simon,

Your grouping of fastq files with different segment numbers is quite welcome but it got me thinking about how this feature might be extended. (Don't you just love users who are never satisfied.) More specifically I was thinking it would be very useful to be able to group files based on different criteria such as all files for one sample if run over multiple lanes or all samples in one lane. The new naming convention in CASAVA 1.8+ is:

Code:
<SampleName>_<Barcode>_L00<lane#>_R<read#>_<segment#>.fastq.gz
Apparently your new feature matches every part of the name except the segment#. So presumably it wouldn't be too difficult to have options to match on SampleName,Barcode,read# only or lane#,read# only.

Maybe that could go on the list of possible features for a future release.
kmcarr is offline   Reply With Quote
Old 09-09-2011, 11:26 AM   #23
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

I vote to request that feature as well.

Quote:
Originally Posted by kmcarr View Post
Simon,

Your grouping of fastq files with different segment numbers is quite welcome but it got me thinking about how this feature might be extended. (Don't you just love users who are never satisfied.) More specifically I was thinking it would be very useful to be able to group files based on different criteria such as all files for one sample if run over multiple lanes or all samples in one lane. The new naming convention in CASAVA 1.8+ is:

Code:
<SampleName>_<Barcode>_L00<lane#>_R<read#>_<segment#>.fastq.gz
Apparently your new feature matches every part of the name except the segment#. So presumably it wouldn't be too difficult to have options to match on SampleName,Barcode,read# only or lane#,read# only.

Maybe that could go on the list of possible features for a future release.
GenoMax is offline   Reply With Quote
Old 09-16-2011, 05:04 AM   #24
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

If it's any help to anyone I've written up the problem we'd found when moving over to using Casava 1.8 on our pipeline, along with the work rounds we're now using.

I'll have a think about the best way to flexibly group samples together in FastQC reports.
simonandrews is offline   Reply With Quote
Old 09-16-2011, 05:31 AM   #25
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

Simon,

To be fair they have split the alignment part into a separate step. So one can stop right before that step.

We are basically going to do exactly the same things you outlined in your blog post. I am not sure about your facility but we do use "ELAND" for diagnostic mapping on the control lane (when there are samples known to have strange nucleotide distribution present). So we may end up running the alignment steps for those flowcells.



Quote:
Originally Posted by simonandrews View Post
If it's any help to anyone I've written up the problem we'd found when moving over to using Casava 1.8 on our pipeline, along with the work rounds we're now using.

Last edited by GenoMax; 09-16-2011 at 05:39 AM.
GenoMax is offline   Reply With Quote
Old 09-16-2011, 05:39 AM   #26
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,147
Default

Quote:
Originally Posted by simonandrews View Post
If it's any help to anyone I've written up the problem we'd found when moving over to using Casava 1.8 on our pipeline, along with the work rounds we're now using.

I'll have a think about the best way to flexibly group samples together in FastQC reports.
Great write up Simon, hopefully the folks at Illumina will take notes.

I had thought some more about my request for alternative grouping of samples for FastQC and I realized there might be a problem when regrouping all the demultiplexed samples from a lane. If I understand the way certain modules of FastQC work (e.g. overrepresented sequences) the first 200K reads are used as a reference set which the remaining reads are compared to. Inherent in this is the assumption that reads would be randomly ordered in the file. If the reads are demultiplexed and then grouped back together this would no longer work since the ordering of reads is no longer random. This would now require additional computational gymnastics to create a representative test set for the lane.

Grouping files for the same sample from multiple lanes should be straightforward though since it could be safely assumed that reads for a single sample, even if run over several lanes, are randomly ordered within the set of files.
kmcarr is offline   Reply With Quote
Old 09-16-2011, 05:42 AM   #27
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by GenoMax View Post
To be fair they have split the alignment part into a separate step. So unless you are using ELAND alignments for something you could just omit alignments altogether.
This isn't really any different to how it was before. If you didn't want alignments you just ran with ANALYSIS sequence(_pair). It's just that now you have to replace that with a call to whichever program you're using to filter and combine your fastq files instead of doing it through Gerald.

Quote:
Originally Posted by GenoMax View Post
If you want to simplify things then you can standardize on providing a SampleSheet.csv file (irrespective of whether or not you have multiplex samples) and let the pipeline create the default "Unaligned/Project_FlowCell_ID/Sample_lane(x)" folder hierarchy. Since you are using a LIMS it should be simple to come up with an appropriate SampleSheet.csv file automatically.
For our site that's fine - and that's what we're doing. We can agree on what sample sheet to use. The thing which makes it more tricky is that we distribute a LIMS which gets used on other people's sites. This means we'd either need to get them to use our default sample sheet or we need to hunt much harder to find the files we can associate with each lane. It would have been really easy to have the new system be allowed to run without a sample sheet and just use lane numbers instead of samples rather than forcing you to specify information you may not have.

Quote:
Originally Posted by GenoMax View Post
If you do need to use ELAND for alignment then things get more complicated as you have outlined in your blog post. You would be forced to use the split files (the --fastq-cluster-count set to a large number would not work unless you have a node with gobs of RAM that can deal with a 50+GB fastq file) and deal with the results with additional steps past the pipeline analysis.
According to the docs Eland won't process a fastq file with more than 16million reads in it, so even great gobs of memory won't help. Whether this is actually a limit in practice, or just the limit of the supported configurations, I haven't tested (and don't intend to!).

The fastq cluster count option on its own isn't of great use to us since it still leaves the problem of reads which failed the purity filter being left in the output, so we're still going to need to process the output, even if it's all in one file.
simonandrews is offline   Reply With Quote
Old 09-16-2011, 05:50 AM   #28
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by kmcarr View Post
I had thought some more about my request for alternative grouping of samples for FastQC and I realized there might be a problem when regrouping all the demultiplexed samples from a lane. If I understand the way certain modules of FastQC work (e.g. overrepresented sequences) the first 200K reads are used as a reference set which the remaining reads are compared to. Inherent in this is the assumption that reads would be randomly ordered in the file. If the reads are demultiplexed and then grouped back together this would no longer work since the ordering of reads is no longer random. This would now require additional computational gymnastics to create a representative test set for the lane.
For mixtures of multiplexed samples the overrepresented and Kmer modules aren't going to make any sense anyway since any problems will be at the level of the individual library rather than the lane.

We're actually seeing similar problems already when people pass in sorted BAM files to the program, which obviously provide a very distorted order of sequences and we can easily miss things which happen on later chromosomes.

Unfortunately I don't think there's any way around this without either doing multiple passes through the file, or potentially storing every read in the file in memory - neither of which are a good solution.
simonandrews is offline   Reply With Quote
Old 09-16-2011, 05:55 AM   #29
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

You can omit "SampleSheet.csv" file altogether and the directory hierarchy "Unaligned/Project_Flow_cell_ID/Sample_lane(x)" is still automatically created. The flowcell ID is parsed from the folder name.

The problem I see is if you do not consistently provide a "SampleSheet.csv" file then you would have two paths to worry about (one for multiplexed samples and one for not).

Quote:
Originally Posted by simonandrews View Post
It would have been really easy to have the new system be allowed to run without a sample sheet and just use lane numbers instead of samples rather than forcing you to specify information you may not have.
GenoMax is offline   Reply With Quote
Old 09-16-2011, 06:06 AM   #30
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by GenoMax View Post
You can omit "SampleSheet.csv" file altogether and the directory hierarchy "Unaligned/Project_Flow_cell_ID/Sample_lane(x)" is still automatically created. The flowcell ID is parsed from the folder name.
You're right! I foolishly believed the documentation which starts with the statment:

Quote:
"Demultiplexing needs a BaseCalls directory and a sample sheet to start a run".
It even says that if you don't provide a sample sheet it tries to read one from <input_dir/SampleSheet.csv>, with no mention that you can go without one all together!

I'd tried using a blank sample sheet (with no sample or project names) and that failed, but you can indeed not specify a sample sheet at all.
simonandrews is offline   Reply With Quote
Old 09-22-2011, 02:07 AM   #31
ptran
Junior Member
 
Location: Houston, TX

Join Date: Jun 2011
Posts: 3
Default

Quote:
Originally Posted by simonandrews View Post
You're right! I foolishly believed the documentation which starts with the statment:



It even says that if you don't provide a sample sheet it tries to read one from <input_dir/SampleSheet.csv>, with no mention that you can go without one all together!

I'd tried using a blank sample sheet (with no sample or project names) and that failed, but you can indeed not specify a sample sheet at all.

The v1.8 User guide gives the following scenarios

Bcl Conversion/Demultiplexing Examples
Bcl conversion and demultiplexing support four scenarios:

1)Multiplexed samples present, with sample sheet.
Reads are placed within the directory structure specified by the sample sheet, based
on the index and lane information. Reads for which the index sequence was
ambiguous will be placed in a project directory called Undetermined_indices,
unless the sample sheet specifies a specific sample and project for reads without
index in that lane.

2)Multiplexed and non-multiplexed samples present, with sample sheet.
Reads are placed within the directory structure specified by the sample sheet, based
on the index and lane information. Reads containing ambiguous or no barcodes
will be placed in a project directory called Undetermined_indices, unless the sample
sheet specifies a specific sample and project for reads without index in that lane.

3)No multiplexed samples present, with sample sheet.
Reads are placed within the directory structure directed by the sample sheet, based
on the lane information.

4)No multiplexed samples present, without sample sheet.
Reads are placed in a project directory named after the flow cell, and sample
directories based on the lane number.
ptran is offline   Reply With Quote
Old 09-23-2011, 12:37 PM   #32
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 44
Default

Quote:
Originally Posted by simonandrews View Post
I'm looking into this as well, and it seems that there are a few annoyances with the new version of CASAVA.

I suspect the reason for removing ANALYSIS sequence option is that it's not supposed to be needed since the default output format from base calling is now fastq, so you'd just not run alignment for that lane. However this leads to some headaches:
  1. On some (possibly all?) platforms the output for a single lane is split into multiple fastq files which means you've got to merge these back together for most downstream analysis outside of CASAVA
  2. All sequences are present in the fastq files, even ones which have failed QC. There is a flag in the header which says which sequences should be filtered, but no downstream applications understand this so you'll have to manually remove these entries before doing any further analysis.

Whilst Illumina have tried to move towards more standardised file formats (using Sanger offset fastq files and BAM for alignments), the new version of CASAVA is actually going to make our lives harder since we'll need to introduce a manual step to merge and filter the fastq files to get back to the single output file of good quality we had before.

If I'm reading the manual wrong with this (we haven't actually done a run under 1.8 yet), then I'd be very happy to hear it. How are other people coping with the changes in the new version?
Thanks for this feedback. Your concerns will be addressed in the minor October release. The default behavior will be to omit the reads that do not pass QC. For more detail, please see this post:
http://seqanswers.com/forums/showthr...2126#post52126

Additionally, you will have the option to produce 1 FASTQ file per sample.

thanks,

Semyon
skruglyak is offline   Reply With Quote
Old 09-26-2011, 04:24 AM   #33
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Thanks for implementing this change. It's encouraging to see a company respond to users feedback.
simonandrews is offline   Reply With Quote
Old 10-03-2011, 09:08 AM   #34
tonymell
Junior Member
 
Location: CT

Join Date: Nov 2010
Posts: 1
Default

Using CASAVA 1.8.1. According to Illumina tech support, you can set --fastq-cluster-count up to 999,999,999.
tonymell is offline   Reply With Quote
Reply

Tags
casava 1.8

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:20 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO