Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Getting a list of all index sequences

    Hi!

    For an internal project we are doing, we are trying to get at the actual index sequence for each read (all reads, whether it winds up in the undetermined indices bin or not).

    We are using casava 1.8.2. From what I can tell looking at the Casava User Guide, this information is only present in the binary .bcl files.

    Does anyone know of another way to retrieve this information, other than writing a script that parses binary?

    Thanks for reading!

  • #2
    Mouth,

    The actual barcode read for each read is recorded in the definition line of the FASTQ file. Here is the format of the Illumina FASTQ produced by CASAVA 1.8.x

    Code:
    @HWI-ST957:100:D0V52ACXX:6:1101:1221:2161 1:N:0:CGATGT
    The barcode sequence at the end of the read is the actual barcode read for that cluster.

    Here is an example showing that you can see variation in the barcode recorded in the defline. It grabs the first 1000 deflines from a gzipped fastq file, splits the defline at ":" and takes the 10th field (the barcode), sorts them and counts the number of each uniq one.

    Code:
    zgrep ^@HWI CTRL1_CGATGT_L006_R1_001.fastq.gz | head -1000 | cut -d":" -f10 | sort | uniq -c
          1 CGATAT
          2 CGATGA
          1 CGATGG
        987 CGATGT
          2 CGCTGT
          1 CGGTGT
          6 TGATGT
    You can see that 98.7% match he expected and there are a few with mismatches.

    But be aware if CASAVA demultiplexing was run with default settings no mismatches are allowed in the barcode. You will only see differences between the barcode read vs. the configured if you set up the CASAVA run (configureBclToFastq.pl) with --mismatches=1.

    Comment


    • #3
      Originally posted by kmcarr View Post
      Mouth,

      The actual barcode read for each read is recorded in the definition line of the FASTQ file. Here is the format of the Illumina FASTQ produced by CASAVA 1.8.x

      Code:
      @HWI-ST957:100:D0V52ACXX:6:1101:1221:2161 1:N:0:CGATGT
      The barcode sequence at the end of the read is the actual barcode read for that cluster.

      Here is an example showing that you can see variation in the barcode recorded in the defline. It grabs the first 1000 deflines from a gzipped fastq file, splits the defline at ":" and takes the 10th field (the barcode), sorts them and counts the number of each uniq one.

      Code:
      zgrep ^@HWI CTRL1_CGATGT_L006_R1_001.fastq.gz | head -1000 | cut -d":" -f10 | sort | uniq -c
            1 CGATAT
            2 CGATGA
            1 CGATGG
          987 CGATGT
            2 CGCTGT
            1 CGGTGT
            6 TGATGT
      You can see that 98.7% match he expected and there are a few with mismatches.

      But be aware if CASAVA demultiplexing was run with default settings no mismatches are allowed in the barcode. You will only see differences between the barcode read vs. the configured if you set up the CASAVA run (configureBclToFastq.pl) with --mismatches=1.

      Hi thanks for the reply! I'm aware of the ability to grab the barcode sequence from the fastq files - I was not aware that it showed the actual variations for those barcodes that have 1 mismatch - thanks for that.

      But I also want to see the barcodes for which mismatches are 2 and greater. Are those recorded somewhere other than the .bcl files?

      Comment


      • #4
        Originally posted by Mouth_Breather View Post
        But I also want to see the barcodes for which mismatches are 2 and greater. Are those recorded somewhere other than the .bcl files?
        Those are the reads in the fastq files under the Undetermined/Sample_lanex directories.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        46 views
        0 likes
        Last Post seqadmin  
        Working...
        X