SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem compiling bcl2fastq-1.8.4 emixaM Illumina/Solexa 20 01-05-2015 11:13 AM
bcl2fastq run error wintergreen36 Bioinformatics 14 08-12-2014 04:43 AM
install bcl2fastq-1.8.4 min1204 Bioinformatics 17 07-23-2014 04:37 AM
Miseq: how to access index base calls? bcl2fastq does not work dsobral Illumina/Solexa 19 12-12-2013 07:46 AM
Defining custom index length on IEM naamash Illumina/Solexa 5 10-02-2013 10:25 PM

Reply
 
Thread Tools
Old 12-19-2014, 02:48 AM   #1
PopGenTech
Junior Member
 
Location: Cambridge, UK

Join Date: Dec 2014
Posts: 4
Default bcl2fastq and index length

Hi All,

I'm trying to convert bcl files to fastq and preserve index sequences in the read identifier line.
I followed this guide http://seqanswers.com/forums/showthread.php?t=39153 and tips from GenoMax
got me to where I needed to be, however I am also curious like dsobral as to why 8bp dual indexing (16 bp of I7-I5) ends up
as a 14 bp barcode in the output (see red tag)

Please understand that we do not produce our own sequencing data, and the files I am working with have be obtained with
minimal information about the processes involved. I understand that in both cases sequencing centers uploaded data to basespace
directly following 'typical miseq runs'.

(vex)[ir210@beast Sample_lane1]$ ls
lane1_Undetermined_L001_R1_001.fastq.gz lane1_Undetermined_L001_R2_001.fastq.gz SampleSheet.csv
(vex)[ir210@beast Sample_lane1]$ zcat lane1_Undetermined_L001_R1_001.fastq.gz |head -n4
@MISEQ:30:000000000-AB55B:1:1101:15923:1332 1:N:0:GACCGATGATGCTG
AGGTCTCAGTGGCATGATCATACTTCATTATAGCCTCCAACTCCCTGGGTCAAGCAATCCTTCCACCTCAGCCTTCTAAGTAGCTGGGACTACAGGCGTGCACTACCAGACACTACCTGTCTCTTATACACATCTCCGAGCCCACGAGACG
+
BCCBCFFFFFFFGGGGGGGGGGHHHHGHHHHHHGHHHHHHHHHHHHHHGHHHHHHHHIIHHHHHHHHHHHHHHHHHHHHHHHGHGHGGHHHHHHHHGGGGGGHHHHHFHHGGHGHHHHHHHHHHHFHHHHHHHHHHHGGGGGGHGGFGGGD


What I don’t understand is how does the MiSeq produce 'lost reads' that have correctly formatted 8bp indexes in the identifier line.
Here is a MiSeq automatically generated lost read. How did it determine the identity of the 8th position base if there are phasing issues?

(vex)[ir210@beast SLX-7061.000000000-AA0WP]$ zcat SLX-7061.000000000-AA0WP.s_1.r_1.lostreads.fq.gz|head -n4
@M01686:136:1:1101:15921:1413#CGACTCCT#TGTGTAGA
TCTCAGTTCCTCTATTTTTGTTCTATCCTGCCCTATTTCTAAGTCAGATCCTACATACAAATCATCCACCTATTGATTGCTCCCTACTGTCTCTTATACACATCTCCCTCCCCACGAGACGCCCTCCTCTCTCTTCTCCCGTCTTCTTCTTCTCCACCACACTCTCTTCCCTTCCCTCTTCTTCCTTCCTCCTCTTCCCCCCCCCCCCCCCTTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTCCCCCCTCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+
@BBCCGFGDE@CF,CCCCE+C;;C,;,;<,C6CE,;6<C#####################################################################################################################################################################################################################################################################


Any information regarding the automated Illumina Miseq demuxing path compared with manual bcl2fastq processing would be
gratefully received, especially if you can fill me on how the 8th position base in the Index is actually used.

Thanks!

Last edited by PopGenTech; 12-19-2014 at 02:52 AM.
PopGenTech is offline   Reply With Quote
Old 12-19-2014, 04:36 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Just to clarify: Data in example 1 above was done with bcl2fastq and 2 with MiSeq reporter?

On-board demultiplexing on MiSeq is able to keep all bases on the tag reads. It also does 1-error demultiplexing by default on tag reads (and this can't be turned off).

If the two tag reads can't be correctly assigned, based on the sample sheet you are providing to bcl2fastq, they will end up in the fastq header as a concatenated string.

Most of the discrimination happens in the first 5-7 bases on the tag with standard barcodes so the 8th position is not that critical.
GenoMax is offline   Reply With Quote
Old 12-19-2014, 05:00 AM   #3
PopGenTech
Junior Member
 
Location: Cambridge, UK

Join Date: Dec 2014
Posts: 4
Default

Thanks GenoMax,

Yes that is correct top example is bcl2fastq, bottom is MSR. Note that the reads are unrelated examples from different runs/libraries.

"On-board demultiplexing on MiSeq is able to keep all bases on the tag reads." is what I'm trying to understand.

From your previous explanation, and the fact that the last position of index read isn't phased, I understand that there is no data to identify the 8th position base. Consequently, the only way to deduce the full 8 bp of a tag is informatically. However, if the sample sheet is purposefully fake and the sequencer software doesn't have a look up table of indexes, how can it impute the correct tag? This is why the bcl2fastq read id lines have a 14 bp tag, despite the 16 bp of index sequence.

"If the two tag reads can't be correctly assigned, based on the sample sheet you are providing to bcl2fastq, they will end up in the fastq header as a concatenated string.
"

Yes, as in the lostreads file - but how does the MSR manage to construct a 16 bp tag given the same fake SampleSheet.CSV? What information was used to impute the identity of 8th position base, or should it be really treated as an N?

Is it that on-board demuxing can use sequence run information not available to bcl2fastq in post run analysis, or is it error correction / imputation using a separate algorithm?

Thank you for your kind help and explanations of the process.
PopGenTech is offline   Reply With Quote
Old 12-19-2014, 05:58 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Calls for the last base are there. They are not being imputed. Bcl2fastq ignores the call where as the on-board software keeps it. Instrument is going to sequence as it was set up. If you absolutely need "n" bases and are planning to use bcl2fastq then it is better to set the run up as n+1 cycles.
GenoMax is offline   Reply With Quote
Old 12-19-2014, 06:03 AM   #5
PopGenTech
Junior Member
 
Location: Cambridge, UK

Join Date: Dec 2014
Posts: 4
Default

Thanks GenoMax, that's what I needed to know.
Kind regards.
PopGenTech is offline   Reply With Quote
Old 12-19-2014, 08:02 AM   #6
PopGenTech
Junior Member
 
Location: Cambridge, UK

Join Date: Dec 2014
Posts: 4
Default

Final word: Thanks to CRI UK for pointing me in this direction:

explicitly state the base-mask: --use-bases-mask to override config.xml

#configureBclToFastq.pl --input-dir test_demux/Data/Intensities/BaseCalls --output-dir test_output --sample-sheet test_demux/Data/Intensities/SampleSheet.csv --use-bases-mask y150n,I8,I8,y150n --no-eamss --fastq-cluster-count 0

This gets the full 2x8bp of the index in the output as shown:

(vex)[ir210@beast Sample_lane1]$ zgrep '^@' lane1_Undetermined_L001_R1_001.fastq.gz |head -n10|cut -d: -f10|sort|uniq -c|sort -nr

4 TAAGGCTCTAAGGCTC
2 CTAGTCAGATTCCGAG
1 TACATGAGTCTTTCCC
1 GCCTTAGACGATTGAC
1 GACTAGCTGGCATTGT
1 GACCGATTGATGCTGT
PopGenTech is offline   Reply With Quote
Reply

Tags
bcl2fastq, demultiplexing, index, miseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO