Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • NitaC
    Member
    • Apr 2013
    • 17

    Undetermined.fastq file

    Hello,

    This may be a very elementary question but since what I have found thus far on the internet has not entirely clarified this for me, I figured I'd ask here.

    When a sequencing experiment is run on an Illumina platform, after demultiplexing, there are always *_Undetermined.fastq.gz files. I am lost as to why exactly some reads end up in there, and what the purpose of this file is. I've read that sometimes one may use this file to observe index frequencies or for other troubleshooting issues, but again, I am not entirely clear on this. Is the presence of this file strictly for troubleshooting (i.e. the reads in this file will never be used in any downstream analysis)??

    Thanks in advance for any help on this.
  • Bukowski
    Senior Member
    • Jan 2010
    • 388

    #2
    I think it's just the reads where it has not been possible to demultiplex on the barcode with sufficient accuracy. There's always some data here even if your sample sheet is set up properly.

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #3
      That's correct. Undetermined is also where PhiX reads are supposed to end up if it was spiked in.

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        There are some special circumstances when I deliberately want the reads to go into "undetermined" file (when using CASAVA or bcl2fastq to demultiplex). This preserves the tags in the read ID's. We have built a demultiplexer for Qiime that can use this undetermined file to produce sample files in the qiime format. (Sending all reads to "undetermined" file is achieved by including a dummy tag sequence like YYYY in the samplesheet)

        Comment

        • ScottC
          Senior Member
          • Jan 2008
          • 244

          #5
          Yes, people use it for a variety of different purposes, but it's actual intended purpose is simply as a catch-all for any read that wasn't assignable to a sample for any reason (poor quality, incorrect indexes specified in the sample sheet, missing index sequences (i.e. PhiX reads, which have no index) sequencing error in the index read for that sequence, etc.)

          Comment

          • bvanga
            Junior Member
            • Sep 2016
            • 2

            #6
            Undetermined sequence

            Hi,

            Can anyone please explain how miseq (300bp pair-end sequencing) determines undetermined sequences. I am studying for a microbiome in the plant tissue collected from a fruit plant in the environment. Sequence provider mentioned that I got about 20GB of undetermined, but when I did OTU analysis and BLAST, I am able to differentiate different species and different undetermined OTUs (it is ok to me to expect some undetermined OTUs from the environmental sample) I got bit confused ...does undetermined OTUs are different form the miseq picked undetermined folder?

            Thank you
            Vanga

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Miseq does not determine anything. It is a sequencing platform; all it does is produce sequences - it is up to the user to determine what they are.

              Illumina sequencing platforms support multiplexing, in which multiple libraries are sequenced together. They have different indexes (or bar codes) which indicate the library they came from. During demultiplexing, the reads are split into different libraries based on the bar code (typically, 8bp sequences within the adapters of the molecule being sequenced). If the bar code sequence is low quality, the read will be sent to the "undetermined" bin, meaning that it is not clear which library it came from. It may be possible for the user to BLAST the undetermined bin and decide with high confidence which organism it came from, in situations where the multiplexed organisms are very different. But, I don't recommend that, as it will increase noise. Instead, if you are getting a large volume in your undetermined bin, you should complain to Illumina (or whoever provides your adapters) about wasted sequence due to the low quality of the index reads, or insufficient length and edit distance of indexes to distinguish between libraries.
              Last edited by Brian Bushnell; 03-01-2017, 10:06 PM.

              Comment

              • bvanga
                Junior Member
                • Sep 2016
                • 2

                #8
                Thank you

                In this case, though about 2.0GB was sent into an undetermined bin, I still have obtained about 900 OTUs with good length (350 to 380bp), and sequence depth (about 50,000 reads per sample). BTW what is a good sequence depth? is there any rough figure to judge the sequence depth or is it highly variable based on the sample.

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  "50,000 reads per sample" is not a depth. A depth would be something like "300x", which would be the result of, for example, sequencing 10 million 2x150bp pairs (3Gbp) for a 10Mbp organism.

                  It would be helpful if you could clarify your experiment and goal. Also, I suggest you repost the question in a new thread as it is unrelated to the current thread. By that, I mean, take some time to think about the optimal phrasing of the question, and then create a new thread explaining everything you know about the situation, and what you want to accomplish.

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    Originally posted by bvanga View Post
                    In this case, though about 2.0GB was sent into an undetermined bin, I still have obtained about 900 OTUs with good length (350 to 380bp), and sequence depth (about 50,000 reads per sample). BTW what is a good sequence depth? is there any rough figure to judge the sequence depth or is it highly variable based on the sample.
                    Using reads from "undetermined' pool (if they ended up there after allowing for 1 or more errors in tag reads) is questionable. There are always some reads that can't be explained by observed "tags" in multiplex sequencing. Even if you were able to obtain OTU's from them, you can't be sure which of your samples they belong to.

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      06-02-2026, 10:05 AM
                    • SEQadmin2
                      Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                      by SEQadmin2


                      With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                      Introduction

                      Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                      05-22-2026, 06:42 AM
                    • SEQadmin2
                      Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                      by SEQadmin2

                      Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                      Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                      05-06-2026, 09:04 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, Today, 08:59 AM
                    0 responses
                    9 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-02-2026, 12:03 PM
                    0 responses
                    21 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-02-2026, 11:40 AM
                    0 responses
                    17 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-28-2026, 11:40 AM
                    0 responses
                    30 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...