Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CASAVA 1.8.1: Replacement for ANALYSIS sequence{_pair}?

    I've been trying to convert from CASAVA 1.7 to CASAVA 1.8.1 and have had trouble configuring my sequence alignment runs. There appears to be no equivalent to the configuration file command "ANALYSIS sequence", which appears to no longer be supported. Am I missing something, or is there no way to specify that a lane should only be sequenced, but not mapped?

    HTML Code:
       _.--'"`'--._     _.--'"`'--._    _      Keith Bettinger
    '-:`.'|`|"':-.  '--:`.'|`|"':-.  '.` : '.  Bioinformaticist
      '.  | |  | |'.   '.  | |  | |'.  '.:   ' Stanford Center for Genomics and
    '.  '.| |  | |  '..  '.| |  | |  '.  '.  :          Personalized Medicine
      '.  `.:_ | :_.'  '.  `.:_ | :_.' '.  `.'
         `-..,..-'        `-..,..-'       `

  • #2
    I'm looking into this as well, and it seems that there are a few annoyances with the new version of CASAVA.

    I suspect the reason for removing ANALYSIS sequence option is that it's not supposed to be needed since the default output format from base calling is now fastq, so you'd just not run alignment for that lane. However this leads to some headaches:
    1. On some (possibly all?) platforms the output for a single lane is split into multiple fastq files which means you've got to merge these back together for most downstream analysis outside of CASAVA
    2. All sequences are present in the fastq files, even ones which have failed QC. There is a flag in the header which says which sequences should be filtered, but no downstream applications understand this so you'll have to manually remove these entries before doing any further analysis.


    Whilst Illumina have tried to move towards more standardised file formats (using Sanger offset fastq files and BAM for alignments), the new version of CASAVA is actually going to make our lives harder since we'll need to introduce a manual step to merge and filter the fastq files to get back to the single output file of good quality we had before.

    If I'm reading the manual wrong with this (we haven't actually done a run under 1.8 yet), then I'd be very happy to hear it. How are other people coping with the changes in the new version?

    Comment


    • #3
      Simon/Keith,

      It is true that the "sequence" only option is now redundant since the default output of pipeline v.1.8 is "fastq" format files. Since Illumina has taken away the option of generating the "qseq" files (which were needed for generation of SRF files) they are now leaving *all* (filtered and unfiltered) data in the sequence files.

      This does mean that there is one additional step needed if you want only quality filtered data. (BTW: The code in the example in the Illumina pipeline manual (pg 40) on how to do the filtering is incorrect. Dawe posted the correct code to the filtering in a separate thread about v.1.8 here:http://seqanswers.com/forums/showthr...highlight=dawe) I wish Illumina would offer the filtering option as a command line switch for pipeline so we would not need to worry about an additional step.

      You are supposed to be able to control the number of file pieces that are generated for the sequences by the command line option "--fastq-cluster-count". Illumina recommends that you do not set this over 16 Million. I have not played with this option much but if you were not going to do alignments with ELAND then you can probably set this number higher and generate a single large file of all sequences (has anyone tried this).
      Last edited by GenoMax; 09-07-2011, 05:35 AM. Reason: add_info

      Comment


      • #4
        Originally posted by GenoMax View Post
        The example shown for "cat"ing the gzipped-fastq sequence file pieces together is also incorrect in the pipeline manual (page 39). At least on RHEL/CentOS this does not work. One needs to do something like: gzip -c zip1 zip2 zip2 > finalZip.
        That's odd - I've not tried it with illumina files, but the gzip man page explicitly says that you should be able to concatonate gzipped files and then extract them all in one operation:

        Code:
        ADVANCED USAGE
               Multiple  compressed  files  can be concatenated. In this case, gunzip will extract
               all members at once. For example:
        
                     gzip -c file1  > foo.gz
                     gzip -c file2 >> foo.gz
        
               Then
        
                     gunzip -c foo
        
               is equivalent to
        
                     cat file1 file2

        Comment


        • #5
          Originally posted by simonandrews View Post
          That's odd - I've not tried it with illumina files, but the gzip man page explicitly says that you should be able to concatonate gzipped files and then extract them all in one operation:

          Code:
          ADVANCED USAGE
                 Multiple  compressed  files  can be concatenated. In this case, gunzip will extract
                 all members at once. For example:
          
                       gzip -c file1  > foo.gz
                       gzip -c file2 >> foo.gz
          
                 Then
          
                       gunzip -c foo
          
                 is equivalent to
          
                       cat file1 file2
          "gzip" is needed to concatenate the "gzip" files together (as opposed to the "cat" example shown in the pipeline manual below)
          Code:
          "cat gzip1 gzip2 gzip3 > final_gzip"

          Comment


          • #6
            Originally posted by GenoMax View Post
            "gzip" is needed to concatenate the "gzip" files together
            Really? I can't see how:

            gzip -c file1 > out
            gzip -c file2 >> out

            Is any different from:

            gzip -c file1 > out1
            gzip -c file2 > out2

            cat out1 out2 > out

            Which would be what you're doing in the illumina case. Gzip -c is just writing to stdout and out is being appended to so they should all be equivalent. Testing this on our system suggests that concatonating gzipped files works OK:

            Code:
            $ gzip seq1.txt
            $ gzip seq2.txt
            $ cat seq1.txt.gz seq2.txt.gz > seq12.txt.gz
            $ zcat seq1.txt.gz seq2.txt.gz > seq12_separate.txt
            $ zcat seq12.txt.gz > seq12_together.txt
            $ diff seq12_separate.txt seq12_together.txt
            $ [No output]
            It's possible that this hasn't always been supported (but RHEL5's gzip is OK), or that some gzip compressors may write files which aren't able to be concatonated in this way, but it looks like it should work for normal gzipped data.

            Comment


            • #7
              Simon is indeed right!

              I stand corrected. I should have tested this again before posting (will edit the offending original line out).

              The last time I tried to "cat" about 15 illumina gzip files together, I got an error about the resulting "zip" file not being in the right format. I then switched to the gzip -c option, which worked. I should have looked at this more carefully.



              Originally posted by simonandrews View Post
              Really? I can't see how:


              Code:
              $ gzip seq1.txt
              $ gzip seq2.txt
              $ cat seq1.txt.gz seq2.txt.gz > seq12.txt.gz
              $ zcat seq1.txt.gz seq2.txt.gz > seq12_separate.txt
              $ zcat seq12.txt.gz > seq12_together.txt
              $ diff seq12_separate.txt seq12_together.txt
              $ [No output]
              It's possible that this hasn't always been supported (but RHEL5's gzip is OK), or that some gzip compressors may write files which aren't able to be concatonated in this way, but it looks like it should work for normal gzipped data.

              Comment


              • #8
                Originally posted by simonandrews View Post
                Really? I can't see how:

                gzip -c file1 > out
                gzip -c file2 >> out

                Is any different from:

                gzip -c file1 > out1
                gzip -c file2 > out2

                cat out1 out2 > out

                Which would be what you're doing in the illumina case. Gzip -c is just writing to stdout and out is being appended to so they should all be equivalent. Testing this on our system suggests that concatonating gzipped files works OK:

                Code:
                $ gzip seq1.txt
                $ gzip seq2.txt
                $ cat seq1.txt.gz seq2.txt.gz > seq12.txt.gz
                $ zcat seq1.txt.gz seq2.txt.gz > seq12_separate.txt
                $ zcat seq12.txt.gz > seq12_together.txt
                $ diff seq12_separate.txt seq12_together.txt
                $ [No output]
                It's possible that this hasn't always been supported (but RHEL5's gzip is OK), or that some gzip compressors may write files which aren't able to be concatonated in this way, but it looks like it should work for normal gzipped data.
                Simon,

                It is true that simply cat'ing the gzipped files together will produce a file which can be properly decompressed to be the equivalent of cat'ing the two uncompressed files. However not all programs appear to be able to properly work with cat'ed gzipped files, including your own FastQC. If you attempt to run FastQC on a file like this it will only examine the chunk from the first file in the merged output. Interestingly it will correctly count the number of reads for the whole file but doesn't process them all. Here is an example of running fastqc on such a file:

                Code:
                cat file1.fastq.gz file2.fastq.gz > file12.fastq.gz
                
                fastqc file12.fastq.gz
                Started analysis of file12.fastq.gz
                Approx 5% complete for file12.fastq.gz
                Approx 10% complete for file12.fastq.gz
                Approx 15% complete for file12.fastq.gz
                Approx 20% complete for file12.fastq.gz
                Approx 25% complete for file12.fastq.gz
                Approx 30% complete for file12.fastq.gz
                Approx 35% complete for file12.fastq.gz
                Approx 40% complete for file12.fastq.gz
                Approx 45% complete for file12.fastq.gz
                Approx 50% complete for file12.fastq.gz
                Analysis complete for file12.fastq.gz
                The 'Total Sequences' reported in the output is equivalent to the number of reads in file1.fastq.gz.

                I'm guessing that the java library for reading compressed files is encountering some residual EOF information retained when the files are simply cat'ed together and therefore prematurely terminates reading from the file.

                Comment


                • #9
                  Originally posted by kmcarr View Post
                  Simon,

                  The 'Total Sequences' reported in the output is equivalent to the number of reads in file1.fastq.gz.

                  I'm guessing that the java library for reading compressed files is encountering some residual EOF information retained when the files are simply cat'ed together and therefore prematurely terminates reading from the file.
                  I should hassle the FastQC author about that

                  Actually I'm just writing a new mode for fastqc which will handle both the concatenation and filtering of casava generated fastq files so we don't need to put in work rounds for this.

                  I guess what it comes down to in the end is that concatenation of gzipped files requires some support from the decompressor.

                  Comment


                  • #10
                    Originally posted by simonandrews View Post
                    I should hassle the FastQC author about that
                    Well I hear he's a nice guy.

                    Actually I'm just writing a new mode for fastqc which will handle both the concatenation and filtering of casava generated fastq files so we don't need to put in work rounds for this.

                    I guess what it comes down to in the end is that concatenation of gzipped files requires some support from the decompressor.
                    That would be handy. I get around the need to concatenate files by upping the --fastq-cluster-count parameter to a number large enough to ensure a single output file for each sample. But getting just the filtered reads is a pain. It's not that it's difficult, just time consuming due the the need to decompress, filter, recompress.

                    And one more gripe about CASAVA 1.8.x while I'm at it. Now the good reads are identified with an "N" as in "No, they were not filtered".

                    Comment


                    • #11
                      I am curious as to what number are you using for fastq-cluster-count?


                      Originally posted by kmcarr View Post

                      I get around the need to concatenate files by upping the --fastq-cluster-count parameter to a number large enough to ensure a single output file for each sample.

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        I am curious as to what number are you using for fastq-cluster-count?
                        For our GAIIx I use "--fastq-cluster-count 100000000" (that's 100,000,000 or one hundred million). The exact number is unimportant; just pick a number which is >> than the number of reads in any single sample (if demultiplexing) or lane. Obviously for a HiSeq this number would need to be even larger. I'm alway worried that I will accidentally type 10000000 (10,000,000) so I make sure to double check before hitting enter for configureBclToFastq. I would love it if Illumina had some way for setting the read count per file to unlimited, e.g. "--fastq-cluster-count 0" (are you listening skruglyak?).

                        Comment


                        • #13
                          Glad to know that works.

                          We see 200 - 250 (or more) million reads with HiSeq so I will need to try a correspondingly large number.

                          Originally posted by kmcarr View Post
                          For our GAIIx I use "--fastq-cluster-count 100000000" (that's 100,000,000 or one hundred million). The exact number is unimportant; just pick a number which is >> than the number of reads in any single sample (if demultiplexing) or lane. Obviously for a HiSeq this number would need to be even larger. I'm alway worried that I will accidentally type 10000000 (10,000,000) so I make sure to double check before hitting enter for configureBclToFastq. I would love it if Illumina had some way for setting the read count per file to unlimited, e.g. "--fastq-cluster-count 0" (are you listening skruglyak?).

                          Comment


                          • #14
                            I've just put up a development snapshot of FastQC which introduces a --casava option which will merge together fastq files which come from the same sample group and present only a single output file. It will also filter out any reads which are flagged as filtered so they won't affect the results.

                            You can enable this mode by adding --casava to the command line, or by selecting 'Casava FastQ Files' from the set of file filters in the interactive application.

                            As I don't actually have a run folder from a Casava 1.8 run yet I'd appreciate it if someone who has some of this data already could test this and let me know if it seems to work OK on real data. Once I'm happy I'm not breaking anything with these changes I'll put out an official release containing these changes.

                            The new version is up on our website here (but isn't linked from the project page). It's just the linux/windows package for now, but I can make a Mac application bundle too if anyone needs that.

                            Thanks

                            Comment


                            • #15
                              Simon,

                              Tried the new version out with a sample that was processed with pipeline v.1.8. This was a multiplex sample that was split across 47 gzip files (I did not adjust the --fastq-cluster-count parameter for this run).

                              FastQC analysis finished without a problem. I do not remember how many raw reads there were in the 47 files but FastQC is reporting ~175 million total sequences (which I assume represents the quality filtered reads).

                              Can you add a field to "basic statistics" to report the total number of sequences that went into FastQC, so it would be easy to see what percentage got filtered out.

                              Thanks!

                              Originally posted by simonandrews View Post
                              I've just put up a development snapshot of FastQC which introduces a --casava option which will merge together fastq files which come from the same sample group and present only a single output file. It will also filter out any reads which are flagged as filtered so they won't affect the results.

                              You can enable this mode by adding --casava to the command line, or by selecting 'Casava FastQ Files' from the set of file filters in the interactive application.

                              As I don't actually have a run folder from a Casava 1.8 run yet I'd appreciate it if someone who has some of this data already could test this and let me know if it seems to work OK on real data. Once I'm happy I'm not breaking anything with these changes I'll put out an official release containing these changes.

                              The new version is up on our website here (but isn't linked from the project page). It's just the linux/windows package for now, but I can make a Mac application bundle too if anyone needs that.

                              Thanks

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              71 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X