Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Semyon

    It's great to hear the news, and I'm very concern on the speed for bcl converter in CASAVA1.8. How many hours do we need to get a compressed FASTQ for a typical Hiseq 2000 run (with and without multiplexing)? And how about its parallelization support? Thanks.

    Ying

    Originally posted by skruglyak View Post
    Hi everyone,

    my name is Semyon and I work in Bioinformatics at Illumina. Our team has prepared a document describing the major changes planned in CASAVA 1.8. The document is available at iCom and attached to this post. I will do my best to follow the thread and answer any questions that you may have. Early access of the release is planned for late February.

    The key changes are:

    1. The bcl converter will be distributed with CASAVA.
    2. The converter will produce compressed FASTQ files rather than qseq files.
    3. The FASTQ quality score encoding will use the standard offset value of 33 rather than the previous Illumina-specific offset value of 64.
    4. If samples have been multiplexed in a sequencing run using indexing, the converter will also perform demultiplexing.
    5. The output files will be in a directory structure organized by project and sample rather than lane and tile.
    6. The GERALD summary file will be modified in accordance with the new directory structure.
    7. The sequence output of post-alignment analysis will be a set of BAM files.

    Thanks!
    Last edited by Auction; 01-20-2011, 09:00 AM.

    Comment


    • #17
      Originally posted by selen View Post
      Thanks Semyon, sample sheet is the key to my question. I assume you give this file as an input along with the config.txt while running bclconverter with the alignment option. Or is it something else?

      The user guide for CASAVA1.8 hasn't been released yet, that would clarify most of my questions I bet.

      Thanks a lot
      Selen

      You are exactly right. The sample sheet will be part of the input to the converter. We are working on the user guide and it will certainly have all of the details on this.

      Thanks!
      Semyon

      Comment


      • #18
        Originally posted by NGSfan View Post
        those SRF/SRA files are enormous! I guess they don't have storage concerns... but imagine transfering two HiSeq2000 runs in SRF/SRA format
        For these large transfers NCBI forgoes protocols like FTP in favor of the FASP protocol from Aspera. FASP is UDP based and theoretical bandwidth is 1Gbps; they report seeing effective bandwidth of 600Mbps.

        Comment


        • #19
          Originally posted by Auction View Post
          Semyon

          It's great to hear the news, and I'm very concern on the speed for bcl converter in CASAVA1.8. How many hours do we need to get a compressed FASTQ for a typical Hiseq 2000 run (with and without multiplexing)? And how about its parallelization support? Thanks.

          Ying
          I'm a little concerned about this too. My experience/observation with CASVA 1.7 and bclConverter that the .bcl -> qseq step is very fast, but the qseq -> fastq step using GERALD (buildSeq.pl ?) is very, very slow.

          Whenever possible I will not use GERALD to build fastq files. I use bclConverter to generate qseqs then go from qseq -> srf (illumina2srf) and then srf -> fastq (srf2fastq). Even though this is a two step process it is still many times faster than using GERALD to build fastqs. Plus I have the bonus of the .srf file which no doubt I will need 18 months down the line with the research wants to publish his results and needs to submit the data to SRA.

          Comment


          • #20
            Originally posted by skruglyak View Post
            You are exactly right. The sample sheet will be part of the input to the converter. We are working on the user guide and it will certainly have all of the details on this.

            Thanks!
            Semyon
            Semyon,

            Within the sample folder, the name of each fastq file provides the sample, index, lane and read information. What about the last three digit (001, 002..)? Do they represent the repeated analysis of the same data?

            Comment


            • #21
              Originally posted by selen View Post
              Semyon,

              Within the sample folder, the name of each fastq file provides the sample, index, lane and read information. What about the last three digit (001, 002..)? Do they represent the repeated analysis of the same data?

              No, the last digits represent the splitting up of a large file into smaller files. Certain cluster environments will be unhappy if individual files go beyond a certain size. There will be a configurable parameter to control the maximum number of entries in any one fastq file.

              Thanks,
              Semyon

              Comment


              • #22
                Originally posted by Auction View Post
                Semyon

                It's great to hear the news, and I'm very concern on the speed for bcl converter in CASAVA1.8. How many hours do we need to get a compressed FASTQ for a typical Hiseq 2000 run (with and without multiplexing)? And how about its parallelization support? Thanks.

                Ying

                Hi Ying,

                there will be parallelization support. Exact timings are difficult to provide because there is a dependence on compute environment. It is also important to know whether you are CPU or I/O bound. We even see large variation depending on cluster utilization. As a very rough approximation, I would estimate 20 CPU hours for a standard HiSeq run, so maybe 2.5 hours if you use 8 CPUs.

                thanks,
                Semyon

                Comment


                • #23
                  Originally posted by skruglyak View Post
                  Hi Ying,

                  there will be parallelization support. Exact timings are difficult to provide because there is a dependence on compute environment. It is also important to know whether you are CPU or I/O bound. We even see large variation depending on cluster utilization. As a very rough approximation, I would estimate 20 CPU hours for a standard HiSeq run, so maybe 2.5 hours if you use 8 CPUs.

                  thanks,
                  Semyon
                  Semyon

                  Thank you for the information. And how many additional time (percentage) will demultiplexing introduce in the new version? In additional, in CASAVA 1.7 "BAM output from RNA-Seq builds does not contain information on how alignments span exons. Such reads are represented by a separate BAM record for each partial exon alignment." Such setting makes it difficut to visualize such reads in IGV. Will the CASAVA 1.8 support the flag like "27M140N3M" as in Tophat, therefore the user can easy detect the reads that span exons? Thanks.

                  Ying

                  Ying

                  Comment


                  • #24
                    Originally posted by kmcarr View Post
                    I have a concern about #2. Currently the illumina2srf tool uses the qseq files as input to generate the .srf files which are required for submission of NGS sequencing data to the NCBI or EBI SRAs. Will it still be possible to generate qseqs or would it be possible for the CASAVA team to work with the developers of the sequenceread toolkit to allow it to work directly from the .bcl files?

                    We will be working with the archives to make sure that the submission process is smooth. I will post more details as they become available. The current thinking is that BAM files will become the submission format and there will no longer be a need to go from qseq to srf. Submission directly from .bcl would not work for a variety of reasons.

                    Thanks,
                    Semyon

                    Comment


                    • #25
                      Originally posted by Auction View Post
                      Semyon

                      Thank you for the information. And how many additional time (percentage) will demultiplexing introduce in the new version? In additional, in CASAVA 1.7 "BAM output from RNA-Seq builds does not contain information on how alignments span exons. Such reads are represented by a separate BAM record for each partial exon alignment." Such setting makes it difficut to visualize such reads in IGV. Will the CASAVA 1.8 support the flag like "27M140N3M" as in Tophat, therefore the user can easy detect the reads that span exons? Thanks.

                      Ying

                      Ying
                      Hi Ying,
                      I believe that demultiplexing time will be "in the noise" but I will post once we have some better numbers. Regarding RNA-Seq, the new version should meet your needs. BAM files produced in CASAVA 1.8 RNA-Seq builds use the CIGAR skip character ("N") to represent intron spanning reads as in Tophat's SAM output. These files can be visualized in the Broad IGV without modification.

                      Thanks,
                      Semyon

                      Comment


                      • #26
                        Originally posted by skruglyak View Post
                        [...]
                        2. The converter will produce compressed FASTQ files rather than qseq files.
                        [...]
                        Does this mean that I finally can use CASAVA 1.8 with just my fastq files without having the qseqs and some other files? People sometimes have no access to the original qseqs files and, as a consequence, were not able to use CASAVA (correct me if I am wrong).

                        cheers,
                        Sven

                        Comment


                        • #27
                          Originally posted by sklages View Post
                          Does this mean that I finally can use CASAVA 1.8 with just my fastq files without having the qseqs and some other files? People sometimes have no access to the original qseqs files and, as a consequence, were not able to use CASAVA (correct me if I am wrong).

                          cheers,
                          Sven
                          Hi Sven,

                          You certainly will not need qseq files. If you have nothing but fastq, I guess you could use ELAND in stand-alone mode, but you would be missing the statistics. To really use CASAVA 1.8, you would also need the fastq files to be in a simple directory structure described in the document and you would need some config files. Of course, if you just start with our new bcl converter (to be distributed with 1.8), the directory structure, the fastq, and the config files will all be generated.

                          Thanks,
                          Semyon

                          Comment


                          • #28
                            Originally posted by skruglyak View Post
                            Hi Sven,

                            You certainly will not need qseq files. If you have nothing but fastq, I guess you could use ELAND in stand-alone mode, but you would be missing the statistics. To really use CASAVA 1.8, you would also need the fastq files to be in a simple directory structure described in the document and you would need some config files. Of course, if you just start with our new bcl converter (to be distributed with 1.8), the directory structure, the fastq, and the config files will all be generated.

                            Thanks,
                            Semyon
                            Hi Semyon,

                            for the new GAII/HiSeq2000 runs I will surely use the software as intended, but as usual, other people read about CASAVA's capabilities/performance and want their datasets to be mapped and, very important, to be variant-called again. If the datasets are "old", we don't keep them online anymore, what is left in this case, are the user's FastQ files. That's why I am asking. The fastq files themselves should be enough for mapping and SNP calling!?

                            cheers,
                            Sven

                            Comment


                            • #29
                              Originally posted by sklages View Post
                              Hi Semyon,

                              for the new GAII/HiSeq2000 runs I will surely use the software as intended, but as usual, other people read about CASAVA's capabilities/performance and want their datasets to be mapped and, very important, to be variant-called again. If the datasets are "old", we don't keep them online anymore, what is left in this case, are the user's FastQ files. That's why I am asking. The fastq files themselves should be enough for mapping and SNP calling!?

                              cheers,
                              Sven
                              If CASAVA 1.8 will take plain FASTQ for input for mapping and SNP calling (to be confirmed), I would expect you'd have to convert your old Solexa/Illumina 1.3+ FASTQ files into Sanger FASTQ files first.

                              Comment


                              • #30
                                Originally posted by maubp View Post
                                If CASAVA 1.8 will take plain FASTQ for input for mapping and SNP calling (to be confirmed), I would expect you'd have to convert your old Solexa/Illumina 1.3+ FASTQ files into Sanger FASTQ files first.
                                Well, that's ok, just (another) conversion. If I'd need some run specific files ... this would make things more complicated (again) ..

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X