Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient way to split FASTQ files based on Illumina indexes in the ID

    I have received new NextSeq reads from our core facility in a semi-demultiplexed state. The P5 and P7 indices are placed in the sequence ID in an unsorted state. Here is an example of four reads in one of the pair's fastq files:
    Code:
    @NS500551:36:H5VJNBGXX:1:11101:17033:1044 2:N:0:GGACTCCT+GCGATCTA
    GGGAGGTCTATATAAGCAGAGCTGGTACCA............
    +
    AAAAA.FF<)<.<FFFFFFA<.FFFFFF.F.FFFFA..........
    @NS500551:36:H5VJNBGXX:1:11101:2211:1044 2:N:0:TAAGGCGA+TCTACTCT
    GGGAGGTCTATATAAGCAGAGCTATAACCTC.......
    +
    AAA<A.FFF)7.<FFF7.AFFAA)F<AA)FFFFAA.......
    @NS500551:36:H5VJNBGXX:1:11101:24462:1044 2:N:0:TCCTGAGC+GCGATCTA
    GGGAGGTCTATATAAGCAGAGCTGGTACCAC........
    +
    <AA.A.FA<.7.FFFF<)FFFFAFF<A<.<FF<FF.....
    @NS500551:36:H5VJNBGXX:1:11101:16844:1044 2:N:0:AGGCAGAA+TCTACTCT
    GGGAGGTCTATATAAGCAGAGCTATAACTTCG........
    +
    AAA<A.F.F<A<.FFFF<F)F.FAFFF<FFAFFFFFFFFF......
    I would like to split the reads into separate fastq files based on the indices, but I cannot find any suitable tools to do it. It needs to be reasonably fast as well, as this sequencing run has 400 million reads ....

    All help is very appreciated.

  • #2
    Was the facility unwilling to demultiplex these for you? That seems kind of odd (unless you chose not to provide them with index information beforehand).

    Comment


    • #3
      The most flexible demultiplexing tool I am aware off is this:


      I assume that it should work. Perhaps you have to remove the "+" between the barcodes or modify the script so that it ignores the "+".

      Comment


      • #4
        BBTools has a program, "demuxbyname", which will do this. Usage:

        demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

        "Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

        In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

        Oh, and it's extremely fast.
        Last edited by Brian Bushnell; 04-07-2015, 02:37 PM.

        Comment


        • #5
          bbmap is the solution!

          Thank you all for your help! The "demuxbyname.sh" approach works well and is fast.

          Could I ask you two more things Brian? What I cannot manage to find with this this script is if there is a function to save the unmatched reads to a separate file. Is there such a function? The reason I would like this is that the reads are from NextSeq v1 chemistry and thus, there is a significant amount of reads that have missmatches in the indices. In this software, I cannot manage to find any function for allowing 1 or 2 missmatches (The Illumina demultiplex normally allows for 1 missmatch per index, i.e., a total of two missmatches).

          For the other question why I need to do this. The core facility can and will demultiplex the file for me. They avoided it due to a misdirected kindness due to a miscommunication. It is just that I need the data really soon and they need some time to do it.

          Thank you again!

          Comment


          • #6
            Brian's programs share common options so you may want to try adding "outu=file_name" to your command to see if the unmatched reads are captured there.

            Comment


            • #7
              Hmmm, actually it doesn't have an "outu" flag right now; I'll add that for the next release.

              We strictly throw away all reads with imperfect barcodes to minimize the risk of cross-contamination. But, I could add an option to the program to allow mismatches, I suppose; I might as well.

              This will require a lot of memory, but if you want to capture all of the reads that did not have matching barcodes right now, you can do so like this:

              1) Concatenate all of the output files that did have correct barcodes into a single file:
              cat out_*_1.fq > combined.fq

              2) Run filterbyname.sh:
              filterbyname.sh in=r#.fq out=nonmatching#.fq names=combined.fq include=f

              Comment


              • #8
                Demuxbyname now supports an "outu" flag. Does not support substitutions yet, though.

                Comment


                • #9
                  Does demuxbyname support wildcards by chance?
                  Thanks in advance!

                  Comment


                  • #10
                    It does not explicitly support wildcards, but you also don't necessarily need to supply a list of exact names. For example, with standard Illumina headers that have a barcode in them (at least, in the format we generate them), you can demux into multiple files, one per barcode, without supplying a list of barcodes. Or you can match just a prefix, suffix, or substring (to a list of names) so the rest is implicitly a wildcard... in other words, you can match patterns like "foo*" or "*foo" or "*foo*", but not "foo*bar".

                    Comment


                    • #11
                      Great tool. I was wondering, is there a way to get it to match on the first N nts of the name? My use case is the following: I have a big fastq file with unsplit indices. The indices were read as 9+9 but were in fact only indexed with 6mers. So they look like this:

                      @7001253F:517:CBKMUANXX:5:1107:5342:1998 1:N:0:GTGTGATCT+TCTTTCCCT

                      But only the first 6 nts of the suffix (i.e. GTGTGA) are actually part of the barcode.

                      All my barcodes are separated by hamming distance of 3. Ideally, I would like to separate the barcodes, allowing up to 2 mismatches in the barcode region only, ignoring mismatches in the non-barcode region, e.g.
                      GTGTGA => check for mismatches and sort read into bin
                      TCT+TCTTTCCCT => ignore mismatches in this region

                      So when I run this:
                      demuxbyname.sh in=S1_R1.fastq in2=S1_R2.fastq out=out_%_#.fq prefixmode=f names=barcode_names.txt hdist=2 outu=unmatched

                      I get reads in the unmatched file where some of the mismatches that led to read exclusion are outside the barcode region.

                      I tried the argument length=6, but it did not seem to solve the problem for me. I did not quite understand the documentation for the length argument, reproduced here:

                      length=0
                      "If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names.
                      For example, you could create files based on the first 8 characters of read names."

                      How do you specify if it is a suffix or prefix?
                      What does it mean "insted of or in addition to the list of names"?

                      I did see the argument substring, so perhaps I could use that and it would produce similar results to what I want for the most part, but it's not technically what I want, since I only want to consider matches in the first 6 nts as valid.

                      I realize there is a workaround in that I could write a script to make a duplicate fastq file where I've truncated the barcodes to 6, then run it, then recover the original by read ID, but I was wondering if there is already a built-in way to do this with your tool and I am missing something in my reading of the docs.

                      Thanks in advance!

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      25 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X