Unconfigured Ad

**GenoMax** · 12-11-2013, 12:16 PM

I am surprised that the facility that did your sequencing is not willing to work with you on diagnosing what went wrong.

My guess is basecalling may not be the problem here but sample overclustering may be. As I posted in the other thread do you know the cluster concentration for this run ( clusters/mm^2)?

Did you extract all "tag sequences" from the "Undetermined" pool file or just browsed through some?

If there are 2 (or more N's) in one or both tags then you may need to give up on this run since you are not going to be able to de-multiplex the sequences (you can only allow for a max of 2 error per barcode and that may not work all the time since it depends on combinations of tags you have used).

**dsobral** · 12-11-2013, 12:27 PM

Originally posted by GenoMax View Post

I am surprised that the facility that did your sequencing is not willing to work with you on diagnosing what went wrong.

I'm working with them to try to figure out what might have gone wrong.

Originally posted by GenoMax View Post

My guess is basecalling may not be the problem here but sample overclustering may be. As I posted in the other thread do you know the cluster concentration for this run ( clusters/mm^2)?

I don't know the exact numbers but I think they had a bit of a high value in a couple of runs. Overclustering affects read quality (adding more reads to the unassigned pool), but what I'm observing is that even when the sequence quality is good and cluster density is ok, one specific sample (not always the same) disappears (apparently into the unassigned pool). Both the indexes that are part of that sample are read (meaning in counts reported in the DemultiplexSummaryF1L1.txt file).

Originally posted by GenoMax View Post

Did you extract all "tag sequences" from the "Undetermined" pool file or just browsed through some?

If there are 2 (or more N's) in one or both tags then you may need to give up on this run since you are not going to be able to de-multiplex the sequences (you can only allow for a max of 2 error per barcode and that may not work all the time since it depends on combinations of tags you have used).

What do you mean by the "tag sequences" from the undetermined pool?
What I get is 251x251 reads (without the indexes). that's why I was trying to get the bcl2fastq to see if I could access to the index sequence to find out what's going on.

Thanks for the help,
Daniel

**GenoMax** · 12-11-2013, 12:53 PM

Originally posted by dsobral View Post

I don't know the exact numbers but I think they had a bit of a high value in a couple of runs. Overclustering affects read quality (adding more reads to the unassigned pool), but what I'm observing is that even when the sequence quality is good and cluster density is ok, one specific sample (not always the same) disappears (apparently into the unassigned pool). Both the indexes that are part of that sample are read (meaning in counts reported in the DemultiplexSummaryF1L1.txt file).

Let us wait to get that number. If you are running v.3 kits then the number can go as high as (1300-1400 clusters/mm^2). But if it is higher than that then overclustering is your problem. Normal reads are tolerant to overclustering but the tag reads suffer badly when samples go over a certain cluster #.

Originally posted by dsobral View Post

What do you mean by the "tag sequences" from the undetermined pool?
What I get is 251x251 reads (without the indexes). that's why I was trying to get the bcl2fastq to see if I could access to the index sequence to find out what's going on.

Thanks for the help,
Daniel

If you (or the facility) ran the demultiplexing with the 2D barcodes then the reads that end-up in the "Undetermined" file will contain the tags in the sequence ID line (the two tags will be concatenated together). See example below. In case of bad calls there will be N's in the tags.

@HWI-MXXXX5:34:000000000-AXXXX:1:1101:15353:1403 1:N:0:TAAGGAGTAGATCG

**kcchan** · 12-11-2013, 01:33 PM

Originally posted by GenoMax View Post

If you (or the facility) ran the demultiplexing with the 2D barcodes then the reads that end-up in the "Undetermined" file will contain the tags in the sequence ID line (the two tags will be concatenated together). See example below. In case of bad calls there will be N's in the tags.

That might be a Casava/BCL2Fastq only thing. None of the reads we've generated using MSR have that tag at the end.

**kmcarr** · 12-11-2013, 01:47 PM

Originally posted by dsobral View Post

Hello,

I get data from the local sequencing facility, where they use nextera and duel indexing in a MiSeq machine.

…

I tried bcl2fastq and after quite some time fighting to get it compiled, I couldn't make it work with MiSeq data: file name extensions and versions are apparently not adequate for it.

Daniel,

Bcl2Fastq v1.8.4 works perfectly well with MiSeq data, I use it all the time. What I am wondering is if you really have all that is needed to run it though. Bcl2Fastq requires the full run folder produced by the MiSeq (or HiSeq). It would be highly unusual for a sequencing facility to provide the full run folder to clients (we never do). It is simply too large and most of the data is not useful to the submitter; all they want back are the sequence files. Unless you got the full run folder from your sequencing facility there is nothing Bcl2Fastq can do for you.

Here is link to a description of the MiSeq Run Folder.

**dsobral** · 12-11-2013, 02:02 PM

Originally posted by kcchan View Post

That might be a Casava/BCL2Fastq only thing. None of the reads we've generated using MSR have that tag at the end.

My reads were also generated using MSR and they don't have that tag at the end. (would be really nice though!)

**dsobral** · 12-11-2013, 02:03 PM

Originally posted by kmcarr View Post

Daniel,

Bcl2Fastq v1.8.4 works perfectly well with MiSeq data, I use it all the time. What I am wondering is if you really have all that is needed to run it though. Bcl2Fastq requires the full run folder produced by the MiSeq (or HiSeq). It would be highly unusual for a sequencing facility to provide the full run folder to clients (we never do). It is simply too large and most of the data is not useful to the submitter; all they want back are the sequence files. Unless you got the full run folder from your sequencing facility there is nothing Bcl2Fastq can do for you.

Here is link to a description of the MiSeq Run Folder.

They usually only provide the fastq, but given this issue, I asked them the full folder, so I should have all the files.

Right now the problem I'm having when running is that it does not seem to recognize my Sample sheet file:

configureBclToFastq.pl --input-dir Data/Intensities/BaseCalls --output-dir Unaligned --positions-format .locs --no-eamss
"ERROR: Wrong number of fields in sample sheet (expected: 10, got 2: IEMFileVersion,4)"

**GenoMax** · 12-11-2013, 03:36 PM

Here is an example template samplesheet file for use with MiSeq Reporter. Make sure you save it as a "csv" (comma separated valued) file.

EDIT: I am leaving this here as an example. If you are trying to manually run Bcl2fastq then you will need to use a different samplesheet. The template for that samplesheet is posted in #14 below.

Code:

[Header]
IEMFileVersion,4
Investigator Name, REPLACE
Experiment Name, EXPT
Date,12/6/2013
Workflow,GenerateFASTQ
Application,FASTQ Only
Assay,Nextera
Description,
Chemistry,

[Reads]
250
250

[Settings]
ReverseComplement,0
Adapter,CTGTCTCTTATACACATCT

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
SampleA,,,,N711,AAGAGGCA,N507,AAGGAGTA,PROJECT_NAME,
SampleB,,,,N710,CGAGGCTG,N507,AAGGAGTA,PROJECT_NAME,

**kcchan** · 12-11-2013, 04:19 PM

Originally posted by dsobral View Post

They usually only provide the fastq, but given this issue, I asked them the full folder, so I should have all the files.

Right now the problem I'm having when running is that it does not seem to recognize my Sample sheet file:

configureBclToFastq.pl --input-dir Data/Intensities/BaseCalls --output-dir Unaligned --positions-format .locs --no-eamss
"ERROR: Wrong number of fields in sample sheet (expected: 10, got 2: IEMFileVersion,4)"

Just to clarify what's happening, you're trying to run bcl2fastq using a MiSeq sample sheet. However, the sample sheet required is in a different format. The format GenoMax refers to is the older style used by CASAVA and bcl2fastq. You'll have to make your own sample sheet or install Illumina Experiment manager 1.4 in order to set up the proper sample sheet.

**GenoMax** · 12-11-2013, 05:30 PM

Originally posted by kcchan View Post

Just to clarify what's happening, you're trying to run bcl2fastq using a MiSeq sample sheet. However, the sample sheet required is in a different format. The format GenoMax refers to is the older style used by CASAVA and bcl2fastq. You'll have to make your own sample sheet or install Illumina Experiment manager 1.4 in order to set up the proper sample sheet.

Thanks for the reminder. We never use MiSeq's on-board data analysis so I forgot about the file format.

What's odd is there should be a samplesheet in the folder dsobral got if the full data folder is there.

I have amended my example above to reflect the new format.

dsobral: If you want to make a SampleSheet up using a GUI then you can use the "Illumina Experiment Manager" (v. 1.6.0) that you can download here: http://support.illumina.com/sequenci...downloads.ilmn

By using the older version of Illumina Expt Manager (v.1.4.x) one can make up CASAVA/Bcl2Fastq style samplesheets.

**kcchan** · 12-11-2013, 08:38 PM

Originally posted by GenoMax View Post

Thanks for the reminder. We never use MiSeq's on-board data analysis so I forgot about the file format.

What's odd is there should be a samplesheet in the folder dsobral got if the full data folder is there.

I have amended my example above to reflect the new format.

dsobral: If you want to make a SampleSheet up using a GUI then you can use the "Illumina Experiment Manager" (v. 1.6.0) that you can download here: http://support.illumina.com/sequenci...downloads.ilmn

What you had initially would have worked fine for dsobral's case (running Bcl2Fastq on a MiSeq run folder with a Casava style sample sheet). If you want to use the MiSeq sample sheet you'll need to run MSR locally, which is way more hassle than it's worth.

**dsobral** · 12-12-2013, 02:16 AM

Thanks for the help.

Indeed I have a SampleSheet in the input directory, which looks exacty like the one GenoMax showed. But this format does not seem to work with bcl2fastq.

I could see that the problem comes from this class:
Casava/Demultiplex/SampleSheet/Csv.pm

From what I see this parser assumes the SampleSheet to be composed simply of 10 column lines (with a header), with the following composition:
"FCID","Lane","SampleID","SampleRef","Index","Description","Control","Recipe","Operator","SampleProject"

This does not seems at all compatible with the information in the SampleSheet from the MiSeq (e.g. dual indexing etc...). In any case I tried to create a sample sheet that looked compatible with this format:
FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject
000000000-A501F,1,BOGUS_NAME,BOGUS_REF,AAAAAAAAAAAAAA,A01,,,,BOGUS

Notice that for it to work, FCID needs to be a specific ID (you can get it from the run folder name) and the Index needs to be 14bp (which I already found suspicious).

Now it seemed to work! It creates Fastq with undetermined indexes and it has the index in the end, as GenoMax showed.
BUT (it seemed to good to be true)... the index is 14bp e.g.
@HWI-M01876:4:000000000-A501F:1:1101:15522:1333 1:N:0:TAAGGCGCTCTCTA

This does not seem to be right... We're using dual indexing 8bp i5 and i7...
Taking this example, TAAGGCG corresponds to the first 7 bases of one index and CTCTCTA are the 7 bases of the second... so only 1 base is missing from each index... I think I can actually recover the samples based solely on this, but would be nice to have the complete indexes...

I'm going to try and see if I can recover the "missing sample" now.

Thanks again,
Daniel

**GenoMax** · 12-12-2013, 02:30 AM

Daniel

Good to see that you are hanging in there.

Here is the correct format for the CASAVA style samplesheet. I will leave the new format in on the post above for reference. You can omit "SampleProject" (at least it works with CASAVA) or add it as you found to be the last column in the samplesheet. That only works to segregate your sequences into "Projects" if you have more than one in the lane.

Example for single barcodes:

Code:

FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,Operator,SampleProject
000000000-XXXXX,1,SampleA,no_ref,TAAGGCG,NA,N,NA,NA,
000000000-XXXXX,1,SampleB,no_ref,CGTACTA,NA,N,NA,NA,

Example for 2D barcodes:

Code:

FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,Operator,SampleProject
000000000-XXXXX,1,SampleA,no_ref,TAAGGCG-TAGATCG,NA,N,NA,NA,
000000000-XXXXX,1,SampleB,no_ref,CGTACTA-TAGATCG,NA,N,NA,NA,

Replace "000000000-XXXXX" with your flowcell ID (In your case 000000000-A501F). Change Sample names as needed and adjust rows to suit. You will need to use (n-1) i.e. 7 bases from the 8 base tags (we can explain that later). Note the "-" separating the two tags. Save the file as "csv" format.

You will have to add "--sample-sheet /path_to_samplesheet_you_made" to your bcl2fastq command line.

The tags will look like "TAAGGCGTAGATCG" in the final sequence files ID's. NOTE: The tags get concatenated in the sequence file ("-" gets removed). That is the explanation for your 14 bp tags.

**dsobral** · 12-12-2013, 03:45 AM

Ok, I now really read the sample sheet definition that is on the bcl2fastq user guide. Since I already had a sample sheet I assumed it was that one... didn't realize at first that it was something so different.

Originally posted by GenoMax View Post

Daniel
You will need to use (n-1) i.e. 7 bases from the 8 base tags (we can explain that later). Note the "-" separating the two tags. Save the file as "csv" format.

...

The tags will look like "TAAGGCGTAGATCG" in the final sequence files ID's. NOTE: The tags get concatenated in the sequence file ("-" gets removed). That is the explanation for your 14 bp tags.

I also got the hyphen from the manual. But why is this n-1 in the indexes? Is it because of phasing calculations like in the read?

Topics	Statistics	Last Post
Study Captures the First Moments of DNA Replication by SEQadmin2 Started by SEQadmin2, Yesterday, 12:17 PM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 Yesterday, 12:17 PM
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, 07-23-2026, 11:41 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 07-23-2026, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 37 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM

Unconfigured Ad

Miseq: how to access index base calls? bcl2fastq does not work

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News