SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Consensus from mpileup for haploid sequences (forcing base calls - no ambiguities) ericarcher Bioinformatics 6 01-24-2014 08:52 AM
no base calls? yaximik Sanger/Dye Terminator 0 06-02-2013 02:43 PM
mpileup base-quality filter does not seem to work david.tamborero Bioinformatics 0 12-29-2011 10:35 AM
Range of quality of base calls at each position in my alignment of 454 reads trasver 454 Pyrosequencing 1 03-07-2011 04:31 AM
how does index-tag PCR work? seqgirl123 Illumina/Solexa 2 01-10-2011 10:33 AM

Reply
 
Thread Tools
Old 12-11-2013, 10:52 AM   #1
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default Miseq: how to access index base calls? bcl2fastq does not work

Hello,

I get data from the local sequencing facility, where they use nextera and duel indexing in a MiSeq machine.

I have some issues with multiplexed samples, where sometimes a lot of sequence goes to the unassigned pool. Most often 1 sample is completely missing and the amount of sequence in the unassigned pool roughly corresponds to what I expect.

I looked in the appropriate report files (namely DemultiplexSummaryF1L1.txt), and all the expected indexes (both 1 and 2) are the vast majority of the readings (though I'm kind of surprised with the enormous amount of variants in the indexes - in just 8 bases!).

Thus I don't think it is an incorrect index in the configuration file, otherwise other samples would also be affected. Since two times in a row it was the same combination of indexes that failed, I thought that might be the problem, but in subsequent runs the combination changed... so I'm left trying to figure out what is going on.

I wanted to go back and try the bascalling myself, but so far couldn't see of a way of doing it easily in MiSeq other than by the MiSeq Reporter...

I tried bcl2fastq and after quite some time fighting to get it compiled, I couldn't make it work with MiSeq data: file name extensions and versions are apparently not adequate for it.

Any ideas welcome.

Thanks,
Daniel
dsobral is offline   Reply With Quote
Old 12-11-2013, 11:16 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

I am surprised that the facility that did your sequencing is not willing to work with you on diagnosing what went wrong.

My guess is basecalling may not be the problem here but sample overclustering may be. As I posted in the other thread do you know the cluster concentration for this run ( clusters/mm^2)?

Did you extract all "tag sequences" from the "Undetermined" pool file or just browsed through some?

If there are 2 (or more N's) in one or both tags then you may need to give up on this run since you are not going to be able to de-multiplex the sequences (you can only allow for a max of 2 error per barcode and that may not work all the time since it depends on combinations of tags you have used).
GenoMax is offline   Reply With Quote
Old 12-11-2013, 11:27 AM   #3
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

Quote:
Originally Posted by GenoMax View Post
I am surprised that the facility that did your sequencing is not willing to work with you on diagnosing what went wrong.
I'm working with them to try to figure out what might have gone wrong.

Quote:
Originally Posted by GenoMax View Post
My guess is basecalling may not be the problem here but sample overclustering may be. As I posted in the other thread do you know the cluster concentration for this run ( clusters/mm^2)?
I don't know the exact numbers but I think they had a bit of a high value in a couple of runs. Overclustering affects read quality (adding more reads to the unassigned pool), but what I'm observing is that even when the sequence quality is good and cluster density is ok, one specific sample (not always the same) disappears (apparently into the unassigned pool). Both the indexes that are part of that sample are read (meaning in counts reported in the DemultiplexSummaryF1L1.txt file).

Quote:
Originally Posted by GenoMax View Post
Did you extract all "tag sequences" from the "Undetermined" pool file or just browsed through some?

If there are 2 (or more N's) in one or both tags then you may need to give up on this run since you are not going to be able to de-multiplex the sequences (you can only allow for a max of 2 error per barcode and that may not work all the time since it depends on combinations of tags you have used).
What do you mean by the "tag sequences" from the undetermined pool?
What I get is 251x251 reads (without the indexes). that's why I was trying to get the bcl2fastq to see if I could access to the index sequence to find out what's going on.

Thanks for the help,
Daniel

Last edited by dsobral; 12-11-2013 at 11:34 AM.
dsobral is offline   Reply With Quote
Old 12-11-2013, 11:53 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by dsobral View Post
I don't know the exact numbers but I think they had a bit of a high value in a couple of runs. Overclustering affects read quality (adding more reads to the unassigned pool), but what I'm observing is that even when the sequence quality is good and cluster density is ok, one specific sample (not always the same) disappears (apparently into the unassigned pool). Both the indexes that are part of that sample are read (meaning in counts reported in the DemultiplexSummaryF1L1.txt file).
Let us wait to get that number. If you are running v.3 kits then the number can go as high as (1300-1400 clusters/mm^2). But if it is higher than that then overclustering is your problem. Normal reads are tolerant to overclustering but the tag reads suffer badly when samples go over a certain cluster #.

Quote:
Originally Posted by dsobral View Post

What do you mean by the "tag sequences" from the undetermined pool?
What I get is 251x251 reads (without the indexes). that's why I was trying to get the bcl2fastq to see if I could access to the index sequence to find out what's going on.

Thanks for the help,
Daniel
If you (or the facility) ran the demultiplexing with the 2D barcodes then the reads that end-up in the "Undetermined" file will contain the tags in the sequence ID line (the two tags will be concatenated together). See example below. In case of bad calls there will be N's in the tags.
Quote:
@HWI-MXXXX5:34:000000000-AXXXX:1:1101:15353:1403 1:N:0:TAAGGAGTAGATCG

Last edited by GenoMax; 12-11-2013 at 11:59 AM.
GenoMax is offline   Reply With Quote
Old 12-11-2013, 12:33 PM   #5
kcchan
Senior Member
 
Location: USA

Join Date: Jul 2012
Posts: 182
Default

Quote:
Originally Posted by GenoMax View Post
If you (or the facility) ran the demultiplexing with the 2D barcodes then the reads that end-up in the "Undetermined" file will contain the tags in the sequence ID line (the two tags will be concatenated together). See example below. In case of bad calls there will be N's in the tags.
That might be a Casava/BCL2Fastq only thing. None of the reads we've generated using MSR have that tag at the end.
kcchan is offline   Reply With Quote
Old 12-11-2013, 12:47 PM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

Quote:
Originally Posted by dsobral View Post
Hello,

I get data from the local sequencing facility, where they use nextera and duel indexing in a MiSeq machine.



I tried bcl2fastq and after quite some time fighting to get it compiled, I couldn't make it work with MiSeq data: file name extensions and versions are apparently not adequate for it.
Daniel,

Bcl2Fastq v1.8.4 works perfectly well with MiSeq data, I use it all the time. What I am wondering is if you really have all that is needed to run it though. Bcl2Fastq requires the full run folder produced by the MiSeq (or HiSeq). It would be highly unusual for a sequencing facility to provide the full run folder to clients (we never do). It is simply too large and most of the data is not useful to the submitter; all they want back are the sequence files. Unless you got the full run folder from your sequencing facility there is nothing Bcl2Fastq can do for you.

Here is link to a description of the MiSeq Run Folder.
kmcarr is offline   Reply With Quote
Old 12-11-2013, 01:02 PM   #7
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

Quote:
Originally Posted by kcchan View Post
That might be a Casava/BCL2Fastq only thing. None of the reads we've generated using MSR have that tag at the end.
My reads were also generated using MSR and they don't have that tag at the end. (would be really nice though!)
dsobral is offline   Reply With Quote
Old 12-11-2013, 01:03 PM   #8
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

Quote:
Originally Posted by kmcarr View Post
Daniel,

Bcl2Fastq v1.8.4 works perfectly well with MiSeq data, I use it all the time. What I am wondering is if you really have all that is needed to run it though. Bcl2Fastq requires the full run folder produced by the MiSeq (or HiSeq). It would be highly unusual for a sequencing facility to provide the full run folder to clients (we never do). It is simply too large and most of the data is not useful to the submitter; all they want back are the sequence files. Unless you got the full run folder from your sequencing facility there is nothing Bcl2Fastq can do for you.

Here is link to a description of the MiSeq Run Folder.
They usually only provide the fastq, but given this issue, I asked them the full folder, so I should have all the files.

Right now the problem I'm having when running is that it does not seem to recognize my Sample sheet file:

configureBclToFastq.pl --input-dir Data/Intensities/BaseCalls --output-dir Unaligned --positions-format .locs --no-eamss
"ERROR: Wrong number of fields in sample sheet (expected: 10, got 2: IEMFileVersion,4)"

Last edited by dsobral; 12-11-2013 at 01:07 PM.
dsobral is offline   Reply With Quote
Old 12-11-2013, 02:36 PM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Here is an example template samplesheet file for use with MiSeq Reporter. Make sure you save it as a "csv" (comma separated valued) file.

EDIT: I am leaving this here as an example. If you are trying to manually run Bcl2fastq then you will need to use a different samplesheet. The template for that samplesheet is posted in #14 below.

Code:
[Header]
IEMFileVersion,4
Investigator Name, REPLACE
Experiment Name, EXPT
Date,12/6/2013
Workflow,GenerateFASTQ
Application,FASTQ Only
Assay,Nextera
Description,
Chemistry,

[Reads]
250
250

[Settings]
ReverseComplement,0
Adapter,CTGTCTCTTATACACATCT

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
SampleA,,,,N711,AAGAGGCA,N507,AAGGAGTA,PROJECT_NAME,
SampleB,,,,N710,CGAGGCTG,N507,AAGGAGTA,PROJECT_NAME,

Last edited by GenoMax; 12-12-2013 at 03:35 AM. Reason: Clarification of two samplesheet formats
GenoMax is offline   Reply With Quote
Old 12-11-2013, 03:19 PM   #10
kcchan
Senior Member
 
Location: USA

Join Date: Jul 2012
Posts: 182
Default

Quote:
Originally Posted by dsobral View Post
They usually only provide the fastq, but given this issue, I asked them the full folder, so I should have all the files.

Right now the problem I'm having when running is that it does not seem to recognize my Sample sheet file:

configureBclToFastq.pl --input-dir Data/Intensities/BaseCalls --output-dir Unaligned --positions-format .locs --no-eamss
"ERROR: Wrong number of fields in sample sheet (expected: 10, got 2: IEMFileVersion,4)"
Just to clarify what's happening, you're trying to run bcl2fastq using a MiSeq sample sheet. However, the sample sheet required is in a different format. The format GenoMax refers to is the older style used by CASAVA and bcl2fastq. You'll have to make your own sample sheet or install Illumina Experiment manager 1.4 in order to set up the proper sample sheet.
kcchan is offline   Reply With Quote
Old 12-11-2013, 04:30 PM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by kcchan View Post
Just to clarify what's happening, you're trying to run bcl2fastq using a MiSeq sample sheet. However, the sample sheet required is in a different format. The format GenoMax refers to is the older style used by CASAVA and bcl2fastq. You'll have to make your own sample sheet or install Illumina Experiment manager 1.4 in order to set up the proper sample sheet.
Thanks for the reminder. We never use MiSeq's on-board data analysis so I forgot about the file format.

What's odd is there should be a samplesheet in the folder dsobral got if the full data folder is there.

I have amended my example above to reflect the new format.

dsobral: If you want to make a SampleSheet up using a GUI then you can use the "Illumina Experiment Manager" (v. 1.6.0) that you can download here: http://support.illumina.com/sequenci...downloads.ilmn

By using the older version of Illumina Expt Manager (v.1.4.x) one can make up CASAVA/Bcl2Fastq style samplesheets.

Last edited by GenoMax; 12-12-2013 at 01:35 AM. Reason: Added info about other version of ILMN EXPT MANAGER
GenoMax is offline   Reply With Quote
Old 12-11-2013, 07:38 PM   #12
kcchan
Senior Member
 
Location: USA

Join Date: Jul 2012
Posts: 182
Default

Quote:
Originally Posted by GenoMax View Post
Thanks for the reminder. We never use MiSeq's on-board data analysis so I forgot about the file format.

What's odd is there should be a samplesheet in the folder dsobral got if the full data folder is there.

I have amended my example above to reflect the new format.

dsobral: If you want to make a SampleSheet up using a GUI then you can use the "Illumina Experiment Manager" (v. 1.6.0) that you can download here: http://support.illumina.com/sequenci...downloads.ilmn
What you had initially would have worked fine for dsobral's case (running Bcl2Fastq on a MiSeq run folder with a Casava style sample sheet). If you want to use the MiSeq sample sheet you'll need to run MSR locally, which is way more hassle than it's worth.
kcchan is offline   Reply With Quote
Old 12-12-2013, 01:16 AM   #13
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

Thanks for the help.

Indeed I have a SampleSheet in the input directory, which looks exacty like the one GenoMax showed. But this format does not seem to work with bcl2fastq.

I could see that the problem comes from this class:
Casava/Demultiplex/SampleSheet/Csv.pm

From what I see this parser assumes the SampleSheet to be composed simply of 10 column lines (with a header), with the following composition:
"FCID","Lane","SampleID","SampleRef","Index","Description","Control","Recipe","Operator","SampleProject"

This does not seems at all compatible with the information in the SampleSheet from the MiSeq (e.g. dual indexing etc...). In any case I tried to create a sample sheet that looked compatible with this format:
FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject
000000000-A501F,1,BOGUS_NAME,BOGUS_REF,AAAAAAAAAAAAAA,A01,,,,BOGUS

Notice that for it to work, FCID needs to be a specific ID (you can get it from the run folder name) and the Index needs to be 14bp (which I already found suspicious).

Now it seemed to work! It creates Fastq with undetermined indexes and it has the index in the end, as GenoMax showed.
BUT (it seemed to good to be true)... the index is 14bp e.g.
@HWI-M01876:4:000000000-A501F:1:1101:15522:1333 1:N:0:TAAGGCGCTCTCTA

This does not seem to be right... We're using dual indexing 8bp i5 and i7...
Taking this example, TAAGGCG corresponds to the first 7 bases of one index and CTCTCTA are the 7 bases of the second... so only 1 base is missing from each index... I think I can actually recover the samples based solely on this, but would be nice to have the complete indexes...

I'm going to try and see if I can recover the "missing sample" now.

Thanks again,
Daniel
dsobral is offline   Reply With Quote
Old 12-12-2013, 01:30 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Daniel

Good to see that you are hanging in there.

Here is the correct format for the CASAVA style samplesheet. I will leave the new format in on the post above for reference. You can omit "SampleProject" (at least it works with CASAVA) or add it as you found to be the last column in the samplesheet. That only works to segregate your sequences into "Projects" if you have more than one in the lane.

Example for single barcodes:
Code:
FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,Operator,SampleProject
000000000-XXXXX,1,SampleA,no_ref,TAAGGCG,NA,N,NA,NA,
000000000-XXXXX,1,SampleB,no_ref,CGTACTA,NA,N,NA,NA,
Example for 2D barcodes:
Code:
FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,Operator,SampleProject
000000000-XXXXX,1,SampleA,no_ref,TAAGGCG-TAGATCG,NA,N,NA,NA,
000000000-XXXXX,1,SampleB,no_ref,CGTACTA-TAGATCG,NA,N,NA,NA,
Replace "000000000-XXXXX" with your flowcell ID (In your case 000000000-A501F). Change Sample names as needed and adjust rows to suit. You will need to use (n-1) i.e. 7 bases from the 8 base tags (we can explain that later). Note the "-" separating the two tags. Save the file as "csv" format.

You will have to add "--sample-sheet /path_to_samplesheet_you_made" to your bcl2fastq command line.

The tags will look like "TAAGGCGTAGATCG" in the final sequence files ID's. NOTE: The tags get concatenated in the sequence file ("-" gets removed). That is the explanation for your 14 bp tags.

Last edited by GenoMax; 02-25-2015 at 08:10 AM. Reason: Added samplesheet example for single barcodes
GenoMax is offline   Reply With Quote
Old 12-12-2013, 02:45 AM   #15
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

Ok, I now really read the sample sheet definition that is on the bcl2fastq user guide. Since I already had a sample sheet I assumed it was that one... didn't realize at first that it was something so different.

Quote:
Originally Posted by GenoMax View Post
Daniel
You will need to use (n-1) i.e. 7 bases from the 8 base tags (we can explain that later). Note the "-" separating the two tags. Save the file as "csv" format.

...

The tags will look like "TAAGGCGTAGATCG" in the final sequence files ID's. NOTE: The tags get concatenated in the sequence file ("-" gets removed). That is the explanation for your 14 bp tags.
I also got the hyphen from the manual. But why is this n-1 in the indexes? Is it because of phasing calculations like in the read?
dsobral is offline   Reply With Quote
Old 12-12-2013, 03:32 AM   #16
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

MiSeq was designed to have minimum "user serviceable" parts (including software to run it) and this is a side effect of that convenience. In the process of simplifying the user interface they had to change several things. For most users this is the preferred solution so in general it works fine. Only in cases such as what you saw additional work is needed at times.

Since there is no phasing information available for the last base in tag that base is not considered during demultiplexing. One can use "--use-bases-mask" creatively but using a samplesheet with (n-1) bases works.

I assume you are re-running the de-multiplexing with the correct samplesheet and will let us know what happens.
GenoMax is offline   Reply With Quote
Old 12-12-2013, 07:01 AM   #17
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

Quote:
Originally Posted by GenoMax View Post

I assume you are re-running the de-multiplexing with the correct samplesheet and will let us know what happens.
Just run bcl2fastq with the full run and correct samplesheet.

The results show the same pattern as before: there is one "missing" sample with (almost) no data (1500-2000 reads playing with the mismatch parameter).

Ok, so I think I'm reassured the MiSeq software is doing a good (or at least decent) job. Now I'm trying to dig into the data to try and see if I can find out what may have happened... (fingers crossed). I'll also ask the sequencing facility about the cluster density: but would it affect one sample only?

Thanks a lot,
Daniel
dsobral is offline   Reply With Quote
Old 12-12-2013, 07:28 AM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by dsobral View Post
The results show the same pattern as before: there is one "missing" sample with (almost) no data (1500-2000 reads playing with the mismatch parameter).

Ok, so I think I'm reassured the MiSeq software is doing a good (or at least decent) job. Now I'm trying to dig into the data to try and see if I can find out what may have happened... (fingers crossed). I'll also ask the sequencing facility about the cluster density: but would it affect one sample only?

Thanks a lot,
Daniel
That would almost certainly indicate that you either have a bad library for that sample (or something happened during pooling that caused that sample to drop out).

What fraction of sequences is ending up in the "Undetermined" file and how many total read you got (that should give us an idea of cluster density)?
GenoMax is offline   Reply With Quote
Old 12-12-2013, 07:29 AM   #19
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

FYI, just a note on what I got so far:
The indexes from the "missing" sample are "TAGGCATG" and "TAGATCGC" (TAGGCAT-TAGATCG in the sample sheet).

Looking at the most abundant indexes in the unassigned sequences I don't see any remnants of this combination:
61972 AGGCAGATATCTCG
38880 AGGCAGATATNTCG
22973 CGTACTATATCTCG
15575 GGACTCCTATCTCG
14984 CGTACTATATNTCG
12896 TAAGGCGTATCTCG
11044 TCCTGAGTATCTCG
10075 GGACTCCTATNTCG

Well, my hopes of recovering the data this way are gone...
But at least it reassured me that it shouldn't be anything related to basecalling etc...
And I learned lots of potentially useful things (I hope)!
dsobral is offline   Reply With Quote
Old 12-12-2013, 07:46 AM   #20
dsobral
Member
 
Location: Lisbon, Portugal

Join Date: Jan 2012
Posts: 21
Default

Quote:
Originally Posted by GenoMax View Post
That would almost certainly indicate that you either have a bad library for that sample (or something happened during pooling that caused that sample to drop out).
I'm inclined to think something like that happened. I'll try to see with the sequencing folks...

Quote:
Originally Posted by GenoMax View Post
What fraction of sequences is ending up in the "Undetermined" file and how many total read you got (that should give us an idea of cluster density)?
The fraction that went to undetermined is roughly 5-10% (depending on mismatches in the index). What led me to think that it might be something in base-calling was it was an amount that fit with the expected size for the sample that is missing. I had 24 samples, so I expected ~5% of sequence for it...

Thanks
dsobral is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:26 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO