SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq byb121 Bioinformatics 6 12-20-2013 02:26 AM
Split Large FASTQ file in small FASTQ files with user defined number of reads Windows deepbiomed Bioinformatics 3 04-04-2013 08:14 AM
miseq undetermined fastq m_elena_bioinfo Bioinformatics 4 01-28-2013 08:43 AM
undetermined strand madsaan Bioinformatics 0 01-26-2011 06:25 AM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 05:35 PM

Reply
 
Thread Tools
Old 12-18-2014, 10:00 AM   #1
NitaC
Member
 
Location: Philadelphia

Join Date: Apr 2013
Posts: 17
Question Undetermined.fastq file

Hello,

This may be a very elementary question but since what I have found thus far on the internet has not entirely clarified this for me, I figured I'd ask here.

When a sequencing experiment is run on an Illumina platform, after demultiplexing, there are always *_Undetermined.fastq.gz files. I am lost as to why exactly some reads end up in there, and what the purpose of this file is. I've read that sometimes one may use this file to observe index frequencies or for other troubleshooting issues, but again, I am not entirely clear on this. Is the presence of this file strictly for troubleshooting (i.e. the reads in this file will never be used in any downstream analysis)??

Thanks in advance for any help on this.
NitaC is offline   Reply With Quote
Old 12-18-2014, 11:42 AM   #2
Bukowski
Senior Member
 
Location: Aberdeen, Scotland

Join Date: Jan 2010
Posts: 340
Default

I think it's just the reads where it has not been possible to demultiplex on the barcode with sufficient accuracy. There's always some data here even if your sample sheet is set up properly.
Bukowski is offline   Reply With Quote
Old 12-18-2014, 12:03 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

That's correct. Undetermined is also where PhiX reads are supposed to end up if it was spiked in.
Brian Bushnell is offline   Reply With Quote
Old 12-18-2014, 02:38 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

There are some special circumstances when I deliberately want the reads to go into "undetermined" file (when using CASAVA or bcl2fastq to demultiplex). This preserves the tags in the read ID's. We have built a demultiplexer for Qiime that can use this undetermined file to produce sample files in the qiime format. (Sending all reads to "undetermined" file is achieved by including a dummy tag sequence like YYYY in the samplesheet)
GenoMax is offline   Reply With Quote
Old 12-18-2014, 05:45 PM   #5
ScottC
Senior Member
 
Location: Monash University, Melbourne, Australia.

Join Date: Jan 2008
Posts: 246
Default

Yes, people use it for a variety of different purposes, but it's actual intended purpose is simply as a catch-all for any read that wasn't assignable to a sample for any reason (poor quality, incorrect indexes specified in the sample sheet, missing index sequences (i.e. PhiX reads, which have no index) sequencing error in the index read for that sequence, etc.)
ScottC is offline   Reply With Quote
Old 03-01-2017, 09:27 PM   #6
bvanga
Junior Member
 
Location: New Zealand

Join Date: Sep 2016
Posts: 2
Default Undetermined sequence

Hi,

Can anyone please explain how miseq (300bp pair-end sequencing) determines undetermined sequences. I am studying for a microbiome in the plant tissue collected from a fruit plant in the environment. Sequence provider mentioned that I got about 20GB of undetermined, but when I did OTU analysis and BLAST, I am able to differentiate different species and different undetermined OTUs (it is ok to me to expect some undetermined OTUs from the environmental sample) I got bit confused ...does undetermined OTUs are different form the miseq picked undetermined folder?

Thank you
Vanga
bvanga is offline   Reply With Quote
Old 03-01-2017, 09:51 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Miseq does not determine anything. It is a sequencing platform; all it does is produce sequences - it is up to the user to determine what they are.

Illumina sequencing platforms support multiplexing, in which multiple libraries are sequenced together. They have different indexes (or bar codes) which indicate the library they came from. During demultiplexing, the reads are split into different libraries based on the bar code (typically, 8bp sequences within the adapters of the molecule being sequenced). If the bar code sequence is low quality, the read will be sent to the "undetermined" bin, meaning that it is not clear which library it came from. It may be possible for the user to BLAST the undetermined bin and decide with high confidence which organism it came from, in situations where the multiplexed organisms are very different. But, I don't recommend that, as it will increase noise. Instead, if you are getting a large volume in your undetermined bin, you should complain to Illumina (or whoever provides your adapters) about wasted sequence due to the low quality of the index reads, or insufficient length and edit distance of indexes to distinguish between libraries.

Last edited by Brian Bushnell; 03-01-2017 at 10:06 PM.
Brian Bushnell is offline   Reply With Quote
Old 03-01-2017, 10:18 PM   #8
bvanga
Junior Member
 
Location: New Zealand

Join Date: Sep 2016
Posts: 2
Default Thank you

In this case, though about 2.0GB was sent into an undetermined bin, I still have obtained about 900 OTUs with good length (350 to 380bp), and sequence depth (about 50,000 reads per sample). BTW what is a good sequence depth? is there any rough figure to judge the sequence depth or is it highly variable based on the sample.
bvanga is offline   Reply With Quote
Old 03-01-2017, 10:52 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

"50,000 reads per sample" is not a depth. A depth would be something like "300x", which would be the result of, for example, sequencing 10 million 2x150bp pairs (3Gbp) for a 10Mbp organism.

It would be helpful if you could clarify your experiment and goal. Also, I suggest you repost the question in a new thread as it is unrelated to the current thread. By that, I mean, take some time to think about the optimal phrasing of the question, and then create a new thread explaining everything you know about the situation, and what you want to accomplish.
Brian Bushnell is offline   Reply With Quote
Old 03-02-2017, 04:22 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Quote:
Originally Posted by bvanga View Post
In this case, though about 2.0GB was sent into an undetermined bin, I still have obtained about 900 OTUs with good length (350 to 380bp), and sequence depth (about 50,000 reads per sample). BTW what is a good sequence depth? is there any rough figure to judge the sequence depth or is it highly variable based on the sample.
Using reads from "undetermined' pool (if they ended up there after allowing for 1 or more errors in tag reads) is questionable. There are always some reads that can't be explained by observed "tags" in multiplex sequencing. Even if you were able to obtain OTU's from them, you can't be sure which of your samples they belong to.
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO