Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Miseq: how to access index base calls? bcl2fastq does not work

    Hello,

    I get data from the local sequencing facility, where they use nextera and duel indexing in a MiSeq machine.

    I have some issues with multiplexed samples, where sometimes a lot of sequence goes to the unassigned pool. Most often 1 sample is completely missing and the amount of sequence in the unassigned pool roughly corresponds to what I expect.

    I looked in the appropriate report files (namely DemultiplexSummaryF1L1.txt), and all the expected indexes (both 1 and 2) are the vast majority of the readings (though I'm kind of surprised with the enormous amount of variants in the indexes - in just 8 bases!).

    Thus I don't think it is an incorrect index in the configuration file, otherwise other samples would also be affected. Since two times in a row it was the same combination of indexes that failed, I thought that might be the problem, but in subsequent runs the combination changed... so I'm left trying to figure out what is going on.

    I wanted to go back and try the bascalling myself, but so far couldn't see of a way of doing it easily in MiSeq other than by the MiSeq Reporter...

    I tried bcl2fastq and after quite some time fighting to get it compiled, I couldn't make it work with MiSeq data: file name extensions and versions are apparently not adequate for it.

    Any ideas welcome.

    Thanks,
    Daniel

  • #2
    I am surprised that the facility that did your sequencing is not willing to work with you on diagnosing what went wrong.

    My guess is basecalling may not be the problem here but sample overclustering may be. As I posted in the other thread do you know the cluster concentration for this run ( clusters/mm^2)?

    Did you extract all "tag sequences" from the "Undetermined" pool file or just browsed through some?

    If there are 2 (or more N's) in one or both tags then you may need to give up on this run since you are not going to be able to de-multiplex the sequences (you can only allow for a max of 2 error per barcode and that may not work all the time since it depends on combinations of tags you have used).

    Comment


    • #3
      Originally posted by GenoMax View Post
      I am surprised that the facility that did your sequencing is not willing to work with you on diagnosing what went wrong.
      I'm working with them to try to figure out what might have gone wrong.

      Originally posted by GenoMax View Post
      My guess is basecalling may not be the problem here but sample overclustering may be. As I posted in the other thread do you know the cluster concentration for this run ( clusters/mm^2)?
      I don't know the exact numbers but I think they had a bit of a high value in a couple of runs. Overclustering affects read quality (adding more reads to the unassigned pool), but what I'm observing is that even when the sequence quality is good and cluster density is ok, one specific sample (not always the same) disappears (apparently into the unassigned pool). Both the indexes that are part of that sample are read (meaning in counts reported in the DemultiplexSummaryF1L1.txt file).

      Originally posted by GenoMax View Post
      Did you extract all "tag sequences" from the "Undetermined" pool file or just browsed through some?

      If there are 2 (or more N's) in one or both tags then you may need to give up on this run since you are not going to be able to de-multiplex the sequences (you can only allow for a max of 2 error per barcode and that may not work all the time since it depends on combinations of tags you have used).
      What do you mean by the "tag sequences" from the undetermined pool?
      What I get is 251x251 reads (without the indexes). that's why I was trying to get the bcl2fastq to see if I could access to the index sequence to find out what's going on.

      Thanks for the help,
      Daniel
      Last edited by dsobral; 12-11-2013, 12:34 PM.

      Comment


      • #4
        Originally posted by dsobral View Post
        I don't know the exact numbers but I think they had a bit of a high value in a couple of runs. Overclustering affects read quality (adding more reads to the unassigned pool), but what I'm observing is that even when the sequence quality is good and cluster density is ok, one specific sample (not always the same) disappears (apparently into the unassigned pool). Both the indexes that are part of that sample are read (meaning in counts reported in the DemultiplexSummaryF1L1.txt file).
        Let us wait to get that number. If you are running v.3 kits then the number can go as high as (1300-1400 clusters/mm^2). But if it is higher than that then overclustering is your problem. Normal reads are tolerant to overclustering but the tag reads suffer badly when samples go over a certain cluster #.

        Originally posted by dsobral View Post

        What do you mean by the "tag sequences" from the undetermined pool?
        What I get is 251x251 reads (without the indexes). that's why I was trying to get the bcl2fastq to see if I could access to the index sequence to find out what's going on.

        Thanks for the help,
        Daniel
        If you (or the facility) ran the demultiplexing with the 2D barcodes then the reads that end-up in the "Undetermined" file will contain the tags in the sequence ID line (the two tags will be concatenated together). See example below. In case of bad calls there will be N's in the tags.
        @HWI-MXXXX5:34:000000000-AXXXX:1:1101:15353:1403 1:N:0:TAAGGAGTAGATCG
        Last edited by GenoMax; 12-11-2013, 12:59 PM.

        Comment


        • #5
          Originally posted by GenoMax View Post
          If you (or the facility) ran the demultiplexing with the 2D barcodes then the reads that end-up in the "Undetermined" file will contain the tags in the sequence ID line (the two tags will be concatenated together). See example below. In case of bad calls there will be N's in the tags.
          That might be a Casava/BCL2Fastq only thing. None of the reads we've generated using MSR have that tag at the end.

          Comment


          • #6
            Originally posted by dsobral View Post
            Hello,

            I get data from the local sequencing facility, where they use nextera and duel indexing in a MiSeq machine.



            I tried bcl2fastq and after quite some time fighting to get it compiled, I couldn't make it work with MiSeq data: file name extensions and versions are apparently not adequate for it.
            Daniel,

            Bcl2Fastq v1.8.4 works perfectly well with MiSeq data, I use it all the time. What I am wondering is if you really have all that is needed to run it though. Bcl2Fastq requires the full run folder produced by the MiSeq (or HiSeq). It would be highly unusual for a sequencing facility to provide the full run folder to clients (we never do). It is simply too large and most of the data is not useful to the submitter; all they want back are the sequence files. Unless you got the full run folder from your sequencing facility there is nothing Bcl2Fastq can do for you.

            Here is link to a description of the MiSeq Run Folder.

            Comment


            • #7
              Originally posted by kcchan View Post
              That might be a Casava/BCL2Fastq only thing. None of the reads we've generated using MSR have that tag at the end.
              My reads were also generated using MSR and they don't have that tag at the end. (would be really nice though!)

              Comment


              • #8
                Originally posted by kmcarr View Post
                Daniel,

                Bcl2Fastq v1.8.4 works perfectly well with MiSeq data, I use it all the time. What I am wondering is if you really have all that is needed to run it though. Bcl2Fastq requires the full run folder produced by the MiSeq (or HiSeq). It would be highly unusual for a sequencing facility to provide the full run folder to clients (we never do). It is simply too large and most of the data is not useful to the submitter; all they want back are the sequence files. Unless you got the full run folder from your sequencing facility there is nothing Bcl2Fastq can do for you.

                Here is link to a description of the MiSeq Run Folder.
                They usually only provide the fastq, but given this issue, I asked them the full folder, so I should have all the files.

                Right now the problem I'm having when running is that it does not seem to recognize my Sample sheet file:

                configureBclToFastq.pl --input-dir Data/Intensities/BaseCalls --output-dir Unaligned --positions-format .locs --no-eamss
                "ERROR: Wrong number of fields in sample sheet (expected: 10, got 2: IEMFileVersion,4)"
                Last edited by dsobral; 12-11-2013, 02:07 PM.

                Comment


                • #9
                  Here is an example template samplesheet file for use with MiSeq Reporter. Make sure you save it as a "csv" (comma separated valued) file.

                  EDIT: I am leaving this here as an example. If you are trying to manually run Bcl2fastq then you will need to use a different samplesheet. The template for that samplesheet is posted in #14 below.

                  Code:
                  [Header]
                  IEMFileVersion,4
                  Investigator Name, REPLACE
                  Experiment Name, EXPT
                  Date,12/6/2013
                  Workflow,GenerateFASTQ
                  Application,FASTQ Only
                  Assay,Nextera
                  Description,
                  Chemistry,
                  
                  [Reads]
                  250
                  250
                  
                  [Settings]
                  ReverseComplement,0
                  Adapter,CTGTCTCTTATACACATCT
                  
                  [Data]
                  Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
                  SampleA,,,,N711,AAGAGGCA,N507,AAGGAGTA,PROJECT_NAME,
                  SampleB,,,,N710,CGAGGCTG,N507,AAGGAGTA,PROJECT_NAME,
                  Last edited by GenoMax; 12-12-2013, 04:35 AM. Reason: Clarification of two samplesheet formats

                  Comment


                  • #10
                    Originally posted by dsobral View Post
                    They usually only provide the fastq, but given this issue, I asked them the full folder, so I should have all the files.

                    Right now the problem I'm having when running is that it does not seem to recognize my Sample sheet file:

                    configureBclToFastq.pl --input-dir Data/Intensities/BaseCalls --output-dir Unaligned --positions-format .locs --no-eamss
                    "ERROR: Wrong number of fields in sample sheet (expected: 10, got 2: IEMFileVersion,4)"
                    Just to clarify what's happening, you're trying to run bcl2fastq using a MiSeq sample sheet. However, the sample sheet required is in a different format. The format GenoMax refers to is the older style used by CASAVA and bcl2fastq. You'll have to make your own sample sheet or install Illumina Experiment manager 1.4 in order to set up the proper sample sheet.

                    Comment


                    • #11
                      Originally posted by kcchan View Post
                      Just to clarify what's happening, you're trying to run bcl2fastq using a MiSeq sample sheet. However, the sample sheet required is in a different format. The format GenoMax refers to is the older style used by CASAVA and bcl2fastq. You'll have to make your own sample sheet or install Illumina Experiment manager 1.4 in order to set up the proper sample sheet.
                      Thanks for the reminder. We never use MiSeq's on-board data analysis so I forgot about the file format.

                      What's odd is there should be a samplesheet in the folder dsobral got if the full data folder is there.

                      I have amended my example above to reflect the new format.

                      dsobral: If you want to make a SampleSheet up using a GUI then you can use the "Illumina Experiment Manager" (v. 1.6.0) that you can download here: http://support.illumina.com/sequenci...downloads.ilmn

                      By using the older version of Illumina Expt Manager (v.1.4.x) one can make up CASAVA/Bcl2Fastq style samplesheets.
                      Last edited by GenoMax; 12-12-2013, 02:35 AM. Reason: Added info about other version of ILMN EXPT MANAGER

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        Thanks for the reminder. We never use MiSeq's on-board data analysis so I forgot about the file format.

                        What's odd is there should be a samplesheet in the folder dsobral got if the full data folder is there.

                        I have amended my example above to reflect the new format.

                        dsobral: If you want to make a SampleSheet up using a GUI then you can use the "Illumina Experiment Manager" (v. 1.6.0) that you can download here: http://support.illumina.com/sequenci...downloads.ilmn
                        What you had initially would have worked fine for dsobral's case (running Bcl2Fastq on a MiSeq run folder with a Casava style sample sheet). If you want to use the MiSeq sample sheet you'll need to run MSR locally, which is way more hassle than it's worth.

                        Comment


                        • #13
                          Thanks for the help.

                          Indeed I have a SampleSheet in the input directory, which looks exacty like the one GenoMax showed. But this format does not seem to work with bcl2fastq.

                          I could see that the problem comes from this class:
                          Casava/Demultiplex/SampleSheet/Csv.pm

                          From what I see this parser assumes the SampleSheet to be composed simply of 10 column lines (with a header), with the following composition:
                          "FCID","Lane","SampleID","SampleRef","Index","Description","Control","Recipe","Operator","SampleProject"

                          This does not seems at all compatible with the information in the SampleSheet from the MiSeq (e.g. dual indexing etc...). In any case I tried to create a sample sheet that looked compatible with this format:
                          FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject
                          000000000-A501F,1,BOGUS_NAME,BOGUS_REF,AAAAAAAAAAAAAA,A01,,,,BOGUS

                          Notice that for it to work, FCID needs to be a specific ID (you can get it from the run folder name) and the Index needs to be 14bp (which I already found suspicious).

                          Now it seemed to work! It creates Fastq with undetermined indexes and it has the index in the end, as GenoMax showed.
                          BUT (it seemed to good to be true)... the index is 14bp e.g.
                          @HWI-M01876:4:000000000-A501F:1:1101:15522:1333 1:N:0:TAAGGCGCTCTCTA

                          This does not seem to be right... We're using dual indexing 8bp i5 and i7...
                          Taking this example, TAAGGCG corresponds to the first 7 bases of one index and CTCTCTA are the 7 bases of the second... so only 1 base is missing from each index... I think I can actually recover the samples based solely on this, but would be nice to have the complete indexes...

                          I'm going to try and see if I can recover the "missing sample" now.

                          Thanks again,
                          Daniel

                          Comment


                          • #14
                            Daniel

                            Good to see that you are hanging in there.

                            Here is the correct format for the CASAVA style samplesheet. I will leave the new format in on the post above for reference. You can omit "SampleProject" (at least it works with CASAVA) or add it as you found to be the last column in the samplesheet. That only works to segregate your sequences into "Projects" if you have more than one in the lane.

                            Example for single barcodes:
                            Code:
                            FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,Operator,SampleProject
                            000000000-XXXXX,1,SampleA,no_ref,TAAGGCG,NA,N,NA,NA,
                            000000000-XXXXX,1,SampleB,no_ref,CGTACTA,NA,N,NA,NA,
                            Example for 2D barcodes:
                            Code:
                            FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,Operator,SampleProject
                            000000000-XXXXX,1,SampleA,no_ref,TAAGGCG-TAGATCG,NA,N,NA,NA,
                            000000000-XXXXX,1,SampleB,no_ref,CGTACTA-TAGATCG,NA,N,NA,NA,
                            Replace "000000000-XXXXX" with your flowcell ID (In your case 000000000-A501F). Change Sample names as needed and adjust rows to suit. You will need to use (n-1) i.e. 7 bases from the 8 base tags (we can explain that later). Note the "-" separating the two tags. Save the file as "csv" format.

                            You will have to add "--sample-sheet /path_to_samplesheet_you_made" to your bcl2fastq command line.

                            The tags will look like "TAAGGCGTAGATCG" in the final sequence files ID's. NOTE: The tags get concatenated in the sequence file ("-" gets removed). That is the explanation for your 14 bp tags.
                            Last edited by GenoMax; 02-25-2015, 09:10 AM. Reason: Added samplesheet example for single barcodes

                            Comment


                            • #15
                              Ok, I now really read the sample sheet definition that is on the bcl2fastq user guide. Since I already had a sample sheet I assumed it was that one... didn't realize at first that it was something so different.

                              Originally posted by GenoMax View Post
                              Daniel
                              You will need to use (n-1) i.e. 7 bases from the 8 base tags (we can explain that later). Note the "-" separating the two tags. Save the file as "csv" format.

                              ...

                              The tags will look like "TAAGGCGTAGATCG" in the final sequence files ID's. NOTE: The tags get concatenated in the sequence file ("-" gets removed). That is the explanation for your 14 bp tags.
                              I also got the hyphen from the manual. But why is this n-1 in the indexes? Is it because of phasing calculations like in the read?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X