SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can the upcoming Sandy Bridge i7 Extreme assemble a genome? ymc Bioinformatics 30 06-06-2012 07:38 AM
help. Casava 1.8 demultiplexing senpeng Illumina/Solexa 1 09-19-2011 08:40 AM
CASAVA v1.8 with indels tonio100680 Bioinformatics 3 08-19-2011 05:53 AM
Demultiplexing and CASAVA 1.7 tonio100680 Bioinformatics 14 06-16-2011 11:48 PM
Upcoming in 2009? dsturgill Events / Conferences 1 11-07-2008 02:41 AM

Reply
 
Thread Tools
Old 01-14-2011, 04:47 PM   #1
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 30
Default Upcoming changes in CASAVA

Hi everyone,

my name is Semyon and I work in Bioinformatics at Illumina. Our team has prepared a document describing the major changes planned in CASAVA 1.8. The document is available at iCom and attached to this post. I will do my best to follow the thread and answer any questions that you may have. Early access of the release is planned for late February.

The key changes are:

1. The bcl converter will be distributed with CASAVA.
2. The converter will produce compressed FASTQ files rather than qseq files.
3. The FASTQ quality score encoding will use the standard offset value of 33 rather than the previous Illumina-specific offset value of 64.
4. If samples have been multiplexed in a sequencing run using indexing, the converter will also perform demultiplexing.
5. The output files will be in a directory structure organized by project and sample rather than lane and tile.
6. The GERALD summary file will be modified in accordance with the new directory structure.
7. The sequence output of post-alignment analysis will be a set of BAM files.

Thanks!
Attached Files
File Type: pdf CASAVA1_8_Changes.pdf (438.1 KB, 882 views)
skruglyak is offline   Reply With Quote
Old 01-14-2011, 11:45 PM   #2
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 256
Default

Hooray for changes 1, 2, 3 and 7 :-)
d
dawe is offline   Reply With Quote
Old 01-15-2011, 04:27 AM   #3
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,410
Default

Quote:
Originally Posted by dawe View Post
Hooray for changes 1, 2, 3 and 7 :-)
d
+1

My only concern is that the read names in the FASTQ files will not include the /1 or /2 suffix. This means both the forward and reverse reads get the same identifier, with the number (1 or 2) in the read description (i.e. in the @ line but after a white space). There are nice symmetries with the SAM/BAM format. However, this will mean any existing scripts/tools/pipelines expecting the suffices will need changing.
maubp is offline   Reply With Quote
Old 01-15-2011, 09:59 AM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,283
Default

Quote:
Originally Posted by maubp View Post
+1

My only concern is that the read names in the FASTQ files will not include the /1 or /2 suffix. This means both the forward and reverse reads get the same identifier, with the number (1 or 2) in the read description (i.e. in the @ line but after a white space). There are nice symmetries with the SAM/BAM format. However, this will mean any existing scripts/tools/pipelines expecting the suffices will need changing.
On the other hand, those tools that expect the same identifier but no suffix will not need changing.
nilshomer is offline   Reply With Quote
Old 01-15-2011, 10:49 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 990
Default

I have a concern about #2. Currently the illumina2srf tool uses the qseq files as input to generate the .srf files which are required for submission of NGS sequencing data to the NCBI or EBI SRAs. Will it still be possible to generate qseqs or would it be possible for the CASAVA team to work with the developers of the sequenceread toolkit to allow it to work directly from the .bcl files?
kmcarr is offline   Reply With Quote
Old 01-15-2011, 10:52 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,410
Default

Quote:
Originally Posted by nilshomer View Post
On the other hand, those tools that expect the same identifier but no suffix will not need changing.
Fair point

Then there are also probably tools which don't even look at the read names - from memory Velvet just takes an interleaved file (forward then reverse, typically made by merging a separate pair of files), without worrying about how they are named.

The naming issue isn't critical (but it is something to be aware of)
maubp is offline   Reply With Quote
Old 01-15-2011, 11:40 AM   #7
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 256
Default

Quote:
Originally Posted by kmcarr View Post
I have a concern about #2. Currently the illumina2srf tool uses the qseq files as input to generate the .srf files which are required for submission of NGS sequencing data to the NCBI or EBI SRAs. Will it still be possible to generate qseqs or would it be possible for the CASAVA team to work with the developers of the sequenceread toolkit to allow it to work directly from the .bcl files?
I believe gzipped fastq files can also be uploaded...

D
dawe is offline   Reply With Quote
Old 01-15-2011, 02:27 PM   #8
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 990
Default

Quote:
Originally Posted by dawe View Post
I believe gzipped fastq files can also be uploaded...

D
For the time being. If you look at the NCBI SRA File Format Guide you will see this statement (in bold) under Section 2.11

Quote:
Because of the many benefits of container formats, INSDC SRAs intend to cease support for native and fastq forms by 2011.
They are really pushing submitters to use the containerized binary formats like SRF and SRA.
kmcarr is offline   Reply With Quote
Old 01-17-2011, 08:18 AM   #9
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 843
Default

From my point of view there are good and bad things in the list:

1,2,3 All good!

4 Not really an issue for us so no feeling either way

5,6 Probably bad (for us at least). Having a folder heirarchy which you can only predict by looking up the sample sheet doesn't make my life any easier. I realise that there are problems with just using technical names for results, but I can see this causing more grief. I'd be interested to see what sort of names you get if you use a blank sample sheet as I guess that that's what we'd do if we want to manage samples and projects from outside the Illumina software.

I take it that in the example posted the run folder at the top of the tree would equate to the current Gerald folder so that it would still be simple to do multiple analysis runs of the same data and get easily separated output?

Also in the example tree why are the BAM files under 'Build' and not 'Aligned'?

7 Good and Bad. I agree that the world seems to have settled on BAM as its file format of choice. The compact size will certainly be welcome, and if we can get SRA/ENA to accept BAM files as submissions then lots of people will be happier - but in the mean time there are a bunch of processing steps which were pretty easy with the old eland output, which will be much harder from a BAM file. Just writing a simple filter to extract some entries from a BAM file and write them out to a new one is really non-trivial if done from scratch, whereas it might just be a grep on an eland file.
simonandrews is offline   Reply With Quote
Old 01-17-2011, 01:26 PM   #10
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default @

Qualitatively speaking, would the @ represent a good thing or a bad thing?
rskr is offline   Reply With Quote
Old 01-17-2011, 01:47 PM   #11
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 30
Default

Quote:
Originally Posted by simonandrews View Post
From my point of view there are good and bad things in the list:

1,2,3 All good!

4 Not really an issue for us so no feeling either way

5,6 Probably bad (for us at least). Having a folder heirarchy which you can only predict by looking up the sample sheet doesn't make my life any easier. I realise that there are problems with just using technical names for results, but I can see this causing more grief. I'd be interested to see what sort of names you get if you use a blank sample sheet as I guess that that's what we'd do if we want to manage samples and projects from outside the Illumina software.

I take it that in the example posted the run folder at the top of the tree would equate to the current Gerald folder so that it would still be simple to do multiple analysis runs of the same data and get easily separated output?

Also in the example tree why are the BAM files under 'Build' and not 'Aligned'?

7 Good and Bad. I agree that the world seems to have settled on BAM as its file format of choice. The compact size will certainly be welcome, and if we can get SRA/ENA to accept BAM files as submissions then lots of people will be happier - but in the mean time there are a bunch of processing steps which were pretty easy with the old eland output, which will be much harder from a BAM file. Just writing a simple filter to extract some entries from a BAM file and write them out to a new one is really non-trivial if done from scratch, whereas it might just be a grep on an eland file.
Thank you very much for the feedback! Regarding 5 and 6, if there is no sample sheet, we will have simple default names for the project and the sample. I can send you more detail if you would like.

The motivation behind the change is that we are thinking about increased throughput that will lead to many samples on a single flow cell. The ability to organize the results by project and sample will hopefully be useful. Also, the demultiplexing output is well suited for such a structure.

Running repeated analysis on the same data and getting easily separated output folders will continue to be supported.

The BAM files are under BUILD because they are the result of the post alignment process (sorting is done) and because multiple alignment events (flow cells) can be combined into a single build of CASAVA. The ALIGNED folder will contain the zipped exports.

If you need to parse information out of the BAM file, it would seem that conversion to SAM would get you to the text file that you need.

Thanks again,

Semyon
skruglyak is offline   Reply With Quote
Old 01-19-2011, 01:18 PM   #12
selen
Junior Member
 
Location: Ohio

Join Date: Dec 2010
Posts: 9
Default

Semyon,

Can you please post more information about the directory structure and how you set it up by project/sample names?

Thanks
Selen
selen is offline   Reply With Quote
Old 01-19-2011, 02:57 PM   #13
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 30
Default

Quote:
Originally Posted by selen View Post
Semyon,

Can you please post more information about the directory structure and how you set it up by project/sample names?

Thanks
Selen

Hi Selen,

The most common use case will be through a sample sheet. Each row of the sample sheet would contain information including Lane, Index, Sample Name, and Project Name. The folder structure would then be created based on the names in the sample sheet. We recommend always having a sample sheet, but if one is not provided, we would use a set of simple default names. We would assume that there is just one project and that each lane contains a separate sample.

A diagram of the actual directory structure was in the appendix of the original attachment, so I am not sure what additional information you may want. Please let me know and I will do my best to answer.

Thanks,

Semyon
skruglyak is offline   Reply With Quote
Old 01-20-2011, 06:58 AM   #14
selen
Junior Member
 
Location: Ohio

Join Date: Dec 2010
Posts: 9
Default

Quote:
Originally Posted by skruglyak View Post
Hi Selen,

The most common use case will be through a sample sheet. Each row of the sample sheet would contain information including Lane, Index, Sample Name, and Project Name. The folder structure would then be created based on the names in the sample sheet. We recommend always having a sample sheet, but if one is not provided, we would use a set of simple default names. We would assume that there is just one project and that each lane contains a separate sample.

A diagram of the actual directory structure was in the appendix of the original attachment, so I am not sure what additional information you may want. Please let me know and I will do my best to answer.

Thanks,

Semyon
Thanks Semyon, sample sheet is the key to my question. I assume you give this file as an input along with the config.txt while running bclconverter with the alignment option. Or is it something else?

The user guide for CASAVA1.8 hasn't been released yet, that would clarify most of my questions I bet.

Thanks a lot
Selen
selen is offline   Reply With Quote
Old 01-20-2011, 07:34 AM   #15
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 175
Default

Quote:
Originally Posted by kmcarr View Post
For the time being. If you look at the NCBI SRA File Format Guide you will see this statement (in bold) under Section 2.11



They are really pushing submitters to use the containerized binary formats like SRF and SRA.

those SRF/SRA files are enormous! I guess they don't have storage concerns... but imagine transfering two HiSeq2000 runs in SRF/SRA format



I welcome the updates to CASAVA !

Now if only the OLB could be updated to allow one to train Bustard basecalling on a specific range of cycles and not only the 1st four...

Last edited by NGSfan; 01-20-2011 at 07:38 AM.
NGSfan is offline   Reply With Quote
Old 01-20-2011, 08:50 AM   #16
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Semyon

It's great to hear the news, and I'm very concern on the speed for bcl converter in CASAVA1.8. How many hours do we need to get a compressed FASTQ for a typical Hiseq 2000 run (with and without multiplexing)? And how about its parallelization support? Thanks.

Ying

Quote:
Originally Posted by skruglyak View Post
Hi everyone,

my name is Semyon and I work in Bioinformatics at Illumina. Our team has prepared a document describing the major changes planned in CASAVA 1.8. The document is available at iCom and attached to this post. I will do my best to follow the thread and answer any questions that you may have. Early access of the release is planned for late February.

The key changes are:

1. The bcl converter will be distributed with CASAVA.
2. The converter will produce compressed FASTQ files rather than qseq files.
3. The FASTQ quality score encoding will use the standard offset value of 33 rather than the previous Illumina-specific offset value of 64.
4. If samples have been multiplexed in a sequencing run using indexing, the converter will also perform demultiplexing.
5. The output files will be in a directory structure organized by project and sample rather than lane and tile.
6. The GERALD summary file will be modified in accordance with the new directory structure.
7. The sequence output of post-alignment analysis will be a set of BAM files.

Thanks!

Last edited by Auction; 01-20-2011 at 09:00 AM.
Auction is offline   Reply With Quote
Old 01-20-2011, 09:59 AM   #17
skruglyak
Member
 
Location: San Diego

Join Date: Sep 2010
Posts: 30
Default

Quote:
Originally Posted by selen View Post
Thanks Semyon, sample sheet is the key to my question. I assume you give this file as an input along with the config.txt while running bclconverter with the alignment option. Or is it something else?

The user guide for CASAVA1.8 hasn't been released yet, that would clarify most of my questions I bet.

Thanks a lot
Selen

You are exactly right. The sample sheet will be part of the input to the converter. We are working on the user guide and it will certainly have all of the details on this.

Thanks!
Semyon
skruglyak is offline   Reply With Quote
Old 01-20-2011, 11:18 AM   #18
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 990
Default

Quote:
Originally Posted by NGSfan View Post
those SRF/SRA files are enormous! I guess they don't have storage concerns... but imagine transfering two HiSeq2000 runs in SRF/SRA format
For these large transfers NCBI forgoes protocols like FTP in favor of the FASP protocol from Aspera. FASP is UDP based and theoretical bandwidth is 1Gbps; they report seeing effective bandwidth of 600Mbps.
kmcarr is offline   Reply With Quote
Old 01-20-2011, 11:30 AM   #19
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 990
Default

Quote:
Originally Posted by Auction View Post
Semyon

It's great to hear the news, and I'm very concern on the speed for bcl converter in CASAVA1.8. How many hours do we need to get a compressed FASTQ for a typical Hiseq 2000 run (with and without multiplexing)? And how about its parallelization support? Thanks.

Ying
I'm a little concerned about this too. My experience/observation with CASVA 1.7 and bclConverter that the .bcl -> qseq step is very fast, but the qseq -> fastq step using GERALD (buildSeq.pl ?) is very, very slow.

Whenever possible I will not use GERALD to build fastq files. I use bclConverter to generate qseqs then go from qseq -> srf (illumina2srf) and then srf -> fastq (srf2fastq). Even though this is a two step process it is still many times faster than using GERALD to build fastqs. Plus I have the bonus of the .srf file which no doubt I will need 18 months down the line with the research wants to publish his results and needs to submit the data to SRA.
kmcarr is offline   Reply With Quote
Old 01-20-2011, 11:51 AM   #20
selen
Junior Member
 
Location: Ohio

Join Date: Dec 2010
Posts: 9
Default

Quote:
Originally Posted by skruglyak View Post
You are exactly right. The sample sheet will be part of the input to the converter. We are working on the user guide and it will certainly have all of the details on this.

Thanks!
Semyon
Semyon,

Within the sample folder, the name of each fastq file provides the sample, index, lane and read information. What about the last three digit (001, 002..)? Do they represent the repeated analysis of the same data?
selen is offline   Reply With Quote
Reply

Tags
casava, illumina, secondary analysis

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:35 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.