Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • demultiplexing dual index Illumina reads with mismatch tolerance

    I'm analysing genotyping-by-sequencing data for 2,100 dual-indexed samples and am looking for a fast, mismatch-tolerant demultiplexer.

    Libraries for each sample were prepared using a unique combination of Illumina I5 and I7 index sequences. Libraries were pooled and sequenced using two Hiseq lanes.

    I’ve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.

    Pairwise Hamming distances for the I5 sequences (96 in all, one for each well position) vary between 3 and 6, so I'm hoping there shouldn't be a problem tolerating one mismatch there.

    The I7 index sequences have pairwise Hamming distances between 2 and 6, so I would need to correct mismatching index reads only where they were one mismatch away from a single genuine index sequence and to reject them otherwise.

    Would anyone know of a fast demultiplexer that could offer some or all of this functionality?

    Thanks,

    Stephen

  • #2
    Illumina's bcl2fastq2 would probably be the fastest demultiplexer that's mismatch-tolerant (--barcode-mismatches flag, accepts 0-2, default=1). N.B.-this assumes you have access to the full run directory.

    Comment


    • #3
      Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.

      Comment


      • #4
        Originally posted by spark View Post
        Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.
        That would be the easiest way.

        You could enumerate all tags present in your files and then go back and use the demultiplexer from BBMap to get every read in the original file.

        Comment


        • #5
          Originally posted by spark View Post
          I’ve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.
          Demuxbyname is fairly efficient, but mostly singlethreaded aside from compression, so 32 cores won't help (it reaches its cap at ~3 cores, unless the output files are gzipped and pigz is available). 2 hours sounds kind of long, though - the reason it is singlethreaded is because there is so little work to do, that adding threads won't help. At 2 hours, I wonder if the filesystem is limiting. You can determine (roughly) that by running top to see the CPU utilization of a process; if a singlethreaded process is under 100%, it's constrained by external things.

          I designed it to reprocess Illumina's demultiplexing output, because Illumina software is very slow, and I needed to do multiple tests with a quick turnaround. I have found that one can write their own program faster than an Illumina program can generate output.

          The Illumina software was configured to allow mismatches in barcodes. However, our experiments indicated a substantial cross-contamination rate in multiplexed libraries (which is extremely important in some experiments, such as single-cell). Allowing zero-mismatch barcodes improved the cross-contamination rate somewhat.

          Essentially -
          Do whatever you think is best for your experiment. But, those 30% of reads that were getting discarded... I highly recommend you discard them. They are being discarded for a reason. You can discard them with Illumina's software (which is the most efficient approach), or later with BBMap or whatever. Or you can keep them. But I cannot imagine an experiment that would benefit from importing an additional 30% of unreliable data into a dataset of reliable data.

          Comment


          • #6
            Thanks for the insights & advice - all very helpful.

            We're doing a pilot study comparing sequence reads from a multiplexed PCR of 200 SNP assays with genotypes for the same SNPs generated using an alternative platform, so we're interested in exploring all aspects of the data for now, but I shall bear in mind that the reads with mismatching index sequences may give false results.

            We've 218M reads from two lanes of a Hi Seq flowcell and 2,112 combinations of I5 and I7 adapters (22 I7 x 96 I5) to demultiplex. I'm not quite sure how long the demultiplexing took using your demuxbyname utility, though it ran in a short enough time to be practical, and is very convenient to use from a scripting point of view. One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.

            Code used:
            demuxbyname.sh in=L001_R1_001.fastq.gz out=L1.%.fq.gz outu=L1.bad_index.fq.gz suffix names=bbm_indices ow=true

            ... where bbm_indices contains the list of 2112 I5 x I7 combinations

            Comment


            • #7
              Originally posted by spark View Post
              One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.
              That's interesting. Thanks for sharing! I'm not sure I'll be able to fix it, but I will look into it... were all of the output files complete and formatted correctly? It sounds to me like the program was waiting for some output stream to finish, which never happened, so I suggest verifying that the number of reads in all of the output files actually add up to the number reported. Since your output was gzipped, if anything didn't complete, the output file would be invalid. So -

              cat L1.*.fq.gz | reformat.sh in=stdin.fq.gz pigz=f gzip=f unpigz=f gunzip=f

              That will read the files, count the number of reads, and crash with some kind of error message if any of the gzipped files are corrupt (the pigz/gzip flags tell it to not spawn additional subprocesses for compression/decompression). I've never used demuxbyname with more than ~100 output files. Is pigz installed? You can check with the command "pigz"; if it is not installed, you'll get some variety of "command not found" message. For some reason I set demuxbyname to use pigz by default, which was a bad idea as in cases like this it will try to spawn 2,112 pigz processes (if it is installed) which could cause problems.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X