Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bcl2fastq and index length

    Hi All,

    I'm trying to convert bcl files to fastq and preserve index sequences in the read identifier line.
    I followed this guide http://seqanswers.com/forums/showthread.php?t=39153 and tips from GenoMax
    got me to where I needed to be, however I am also curious like dsobral as to why 8bp dual indexing (16 bp of I7-I5) ends up
    as a 14 bp barcode in the output (see red tag)

    Please understand that we do not produce our own sequencing data, and the files I am working with have be obtained with
    minimal information about the processes involved. I understand that in both cases sequencing centers uploaded data to basespace
    directly following 'typical miseq runs'.

    (vex)[ir210@beast Sample_lane1]$ ls
    lane1_Undetermined_L001_R1_001.fastq.gz lane1_Undetermined_L001_R2_001.fastq.gz SampleSheet.csv
    (vex)[ir210@beast Sample_lane1]$ zcat lane1_Undetermined_L001_R1_001.fastq.gz |head -n4
    @MISEQ:30:000000000-AB55B:1:1101:15923:1332 1:N:0:GACCGATGATGCTG
    AGGTCTCAGTGGCATGATCATACTTCATTATAGCCTCCAACTCCCTGGGTCAAGCAATCCTTCCACCTCAGCCTTCTAAGTAGCTGGGACTACAGGCGTGCACTACCAGACACTACCTGTCTCTTATACACATCTCCGAGCCCACGAGACG
    +
    BCCBCFFFFFFFGGGGGGGGGGHHHHGHHHHHHGHHHHHHHHHHHHHHGHHHHHHHHIIHHHHHHHHHHHHHHHHHHHHHHHGHGHGGHHHHHHHHGGGGGGHHHHHFHHGGHGHHHHHHHHHHHFHHHHHHHHHHHGGGGGGHGGFGGGD


    What I don’t understand is how does the MiSeq produce 'lost reads' that have correctly formatted 8bp indexes in the identifier line.
    Here is a MiSeq automatically generated lost read. How did it determine the identity of the 8th position base if there are phasing issues?

    (vex)[ir210@beast SLX-7061.000000000-AA0WP]$ zcat SLX-7061.000000000-AA0WP.s_1.r_1.lostreads.fq.gz|head -n4
    @M01686:136:1:1101:15921:1413#CGACTCCT#TGTGTAGA
    TCTCAGTTCCTCTATTTTTGTTCTATCCTGCCCTATTTCTAAGTCAGATCCTACATACAAATCATCCACCTATTGATTGCTCCCTACTGTCTCTTATACACATCTCCCTCCCCACGAGACGCCCTCCTCTCTCTTCTCCCGTCTTCTTCTTCTCCACCACACTCTCTTCCCTTCCCTCTTCTTCCTTCCTCCTCTTCCCCCCCCCCCCCCCTTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTCCCCCCTCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
    +
    @BBCCGFGDE@CF,CCCCE+C;;C,;,;<,C6CE,;6<C#####################################################################################################################################################################################################################################################################


    Any information regarding the automated Illumina Miseq demuxing path compared with manual bcl2fastq processing would be
    gratefully received, especially if you can fill me on how the 8th position base in the Index is actually used.

    Thanks!
    Last edited by PopGenTech; 12-19-2014, 03:52 AM.

  • #2
    Just to clarify: Data in example 1 above was done with bcl2fastq and 2 with MiSeq reporter?

    On-board demultiplexing on MiSeq is able to keep all bases on the tag reads. It also does 1-error demultiplexing by default on tag reads (and this can't be turned off).

    If the two tag reads can't be correctly assigned, based on the sample sheet you are providing to bcl2fastq, they will end up in the fastq header as a concatenated string.

    Most of the discrimination happens in the first 5-7 bases on the tag with standard barcodes so the 8th position is not that critical.

    Comment


    • #3
      Thanks GenoMax,

      Yes that is correct top example is bcl2fastq, bottom is MSR. Note that the reads are unrelated examples from different runs/libraries.

      "On-board demultiplexing on MiSeq is able to keep all bases on the tag reads." is what I'm trying to understand.

      From your previous explanation, and the fact that the last position of index read isn't phased, I understand that there is no data to identify the 8th position base. Consequently, the only way to deduce the full 8 bp of a tag is informatically. However, if the sample sheet is purposefully fake and the sequencer software doesn't have a look up table of indexes, how can it impute the correct tag? This is why the bcl2fastq read id lines have a 14 bp tag, despite the 16 bp of index sequence.

      "If the two tag reads can't be correctly assigned, based on the sample sheet you are providing to bcl2fastq, they will end up in the fastq header as a concatenated string.
      "

      Yes, as in the lostreads file - but how does the MSR manage to construct a 16 bp tag given the same fake SampleSheet.CSV? What information was used to impute the identity of 8th position base, or should it be really treated as an N?

      Is it that on-board demuxing can use sequence run information not available to bcl2fastq in post run analysis, or is it error correction / imputation using a separate algorithm?

      Thank you for your kind help and explanations of the process.

      Comment


      • #4
        Calls for the last base are there. They are not being imputed. Bcl2fastq ignores the call where as the on-board software keeps it. Instrument is going to sequence as it was set up. If you absolutely need "n" bases and are planning to use bcl2fastq then it is better to set the run up as n+1 cycles.

        Comment


        • #5
          Thanks GenoMax, that's what I needed to know.
          Kind regards.

          Comment


          • #6
            Final word: Thanks to CRI UK for pointing me in this direction:

            explicitly state the base-mask: --use-bases-mask to override config.xml

            #configureBclToFastq.pl --input-dir test_demux/Data/Intensities/BaseCalls --output-dir test_output --sample-sheet test_demux/Data/Intensities/SampleSheet.csv --use-bases-mask y150n,I8,I8,y150n --no-eamss --fastq-cluster-count 0

            This gets the full 2x8bp of the index in the output as shown:

            (vex)[ir210@beast Sample_lane1]$ zgrep '^@' lane1_Undetermined_L001_R1_001.fastq.gz |head -n10|cut -d: -f10|sort|uniq -c|sort -nr

            4 TAAGGCTCTAAGGCTC
            2 CTAGTCAGATTCCGAG
            1 TACATGAGTCTTTCCC
            1 GCCTTAGACGATTGAC
            1 GACTAGCTGGCATTGT
            1 GACCGATTGATGCTGT

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            50 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X