Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • PeatMaster
    Junior Member
    • Mar 2016
    • 8

    Adding barcode indexes back into FASTQ headers?

    Hi,

    I have a set of fastq files from a fungal ITS2 amplicon run on a MiSeq. There are 3 corresponding fastq files: Read1, Read2 and an Index file. In the past, I have worked with fastq files where the Read 1 and Read 2 files (or when they are interleaved into one file) have the barcode indexes at the end of the fastq header lines in the actual Read 1 and Read 2 files; BBMap/BBDuk worked great for processing these files (e.g., adaptor removal, merging reads). However, in my current situation I have the barcodes in their own fastq file, and I can't seem to find a script in BBMap/BBDuk that accommodates my current situation.

    My questions are:

    Am I missing an argument or script in BBMap that will accommodate my situation?

    Or, does anyone know of a function or script outside of BBMap that will take the barcodes indexes in the index file and put them pack into the end of the header lines in the read1 and read2 files? I have found scripts that go the opposite way (e.g., Qiime's extract_barcodes.py script).

    Thanks for the assistance
  • gringer
    David Eccles (gringer)
    • May 2011
    • 845

    #2
    Why do you want to put barcodes back into sequences? They are usually stripped out of the reads when the demultiplexing is carried out.

    Comment

    • PeatMaster
      Junior Member
      • Mar 2016
      • 8

      #3
      Not back into the read, back into the header line

      Hi, my goal is not to put them back into the sequences themselves but to put them at the end of the header line for each sequence in the fastq. Let me know if this is not clear and I can cut and paste some examples.

      Thanks

      Comment

      • gringer
        David Eccles (gringer)
        • May 2011
        • 845

        #4
        Ah, right, that makes more sense. I wonder if 'paste' would help you here. Here's an example of what it can do:

        Code:
        $ paste -d '|' <(zcat ../ERR063640_1P.fastq.gz | head -n 16) <(zcat ../ERR063640_2P.fastq.gz | head -n 16)
        @ERR063640.4 HS13_6902:1:1101:1393:2125/1|@ERR063640.4 HS13_6902:1:1101:1393:2125/2
        TGGGAATCAGCCGATCGAGTGTACAAAATATAGTAGTTGGAAAACTAAGGCTGAGGAGCTATAAGCTGC|TTTAGAGAGTCCAAACTGTTGTGCCCGGTANNNNNNACCTTCTTCTCGAAAA
        +|+
        B>GGDFJEHBDLIIMJHIKBLLHIIGGKHI=ILHFAHJLMKKIDDKLGLIIFLKIFGKEFHH?@H>JLF|:DDEFH>BIIJGJKLJHKDCJHJJKLHKFB!!!!!!H@H8GKEKKGE8I-FM
        @ERR063640.5 HS13_6902:1:1101:1453:2128/1|@ERR063640.5 HS13_6902:1:1101:1453:2128/2
        TGAGGTGTGAGATGCGTGTCAGTGCAGAGCGAGAGAAAACGGCACGAAATAAGTGCCGGTGCTTCTCATGGAGACGACGATGGTTTGCGTCGCTT|TCACAGATGAGACGGTAAGCTCACAATGACCNNNTNATGTACAATTGTAAATC
        +|+
        CBC;DIJ>HFDLBIJLLJGJEJKKIMLKMJ<ILEIHJNJGDLLFJKFGKEIHLJHFMJGIHJLJFEJDKHGGHKJJLEJKBEL7I7C8HJ5FGF>|:DBEFHBHIIJJHILJHEKJKKI@KLHDKIJ!!!J!LFAJGLLKLIK8EKEF9
        @ERR063640.7 HS13_6902:1:1101:1378:2148/1|@ERR063640.7 HS13_6902:1:1101:1378:2148/2
        GAAATTTTCAAGAATTGACGGAAAACGAATTTCATTGCCGGGCATTCTAAGTGTTAAATTAGCATTGTACTCGTCGAGCTGAGTCGGATGATATGTGATA|CATATCGGCCCTATCACATATCATCCGACTCANCNNNACGAGTACAATACTAA
        +|+
        D>GGILJEJKKLJIKLLLJOLGHJIMKKMKEJHGKLDNFMLKLILKLLLGJNLKJLKKJIGCHHIMILHMGDLKLJ>KJKKHG7MIKKKJFBAGEBBAA<|:DFGFHGILJJLKILJKKKJIDIJLLHKKIOL!G!!!JKKHKEKKIKH9KLFI
        @ERR063640.8 HS13_6902:1:1101:1462:2157/1|@ERR063640.8 HS13_6902:1:1101:1462:2157/2
        ACGAATGCGTGGCGCGTCCTTCGGGGACTGATCAACGAAGACACCTTTACCTTTACCTTTTATACTAAAATTTATCTCCTGAAAGGAGAACTTGTAACA|TTTCATGGTGAGAGTTATGTGCAAAAACCGGAACTCGAACGGAATTCGAAGCAAGGTATCAAA
        +|+
        CCGHIIJEKKKLIMJL@FJJL?BKJO=KH?;HCHNCDAK=K<LHHDJLEBGFLKHEHKLKH9H@FME9KHBJL7H?L8LE8E6KEDCG?L@BG@BAB?A|9DDEFHGHIJJGHILJKKKJLMHJLLHKJNMHM>JKNOGKILIKKIEKIFLFIKJHLNJKJLH
        This could be followed up with code that reads in 4 lines, then displays the left-hand column for the four lines, appending the right-hand column for the second line if the line number is 1. Not sure if that's what you want to do, but here's an example:

        Code:
         paste -d '~' <(zcat ../ERR063640_1P.fastq.gz | head -n 16) <(zcat ../ERR063640_2P.fastq.gz | head -n 16) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= " [barcode ".$F[1]."]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;'
        @ERR063640.4 HS13_6902:1:1101:1393:2125/1 [barcode TTTAGAGAGTCCAAACTGTTGTGCCCGGTANNNNNNACCTTCTTCTCGAAAA]
        TGGGAATCAGCCGATCGAGTGTACAAAATATAGTAGTTGGAAAACTAAGGCTGAGGAGCTATAAGCTGC
        +
        B>GGDFJEHBDLIIMJHIKBLLHIIGGKHI=ILHFAHJLMKKIDDKLGLIIFLKIFGKEFHH?@H>JLF
        @ERR063640.5 HS13_6902:1:1101:1453:2128/1 [barcode TCACAGATGAGACGGTAAGCTCACAATGACCNNNTNATGTACAATTGTAAATC]
        TGAGGTGTGAGATGCGTGTCAGTGCAGAGCGAGAGAAAACGGCACGAAATAAGTGCCGGTGCTTCTCATGGAGACGACGATGGTTTGCGTCGCTT
        +
        CBC;DIJ>HFDLBIJLLJGJEJKKIMLKMJ<ILEIHJNJGDLLFJKFGKEIHLJHFMJGIHJLJFEJDKHGGHKJJLEJKBEL7I7C8HJ5FGF>
        @ERR063640.7 HS13_6902:1:1101:1378:2148/1 [barcode CATATCGGCCCTATCACATATCATCCGACTCANCNNNACGAGTACAATACTAA]
        GAAATTTTCAAGAATTGACGGAAAACGAATTTCATTGCCGGGCATTCTAAGTGTTAAATTAGCATTGTACTCGTCGAGCTGAGTCGGATGATATGTGATA
        +
        D>GGILJEJKKLJIKLLLJOLGHJIMKKMKEJHGKLDNFMLKLILKLLLGJNLKJLKKJIGCHHIMILHMGDLKLJ>KJKKHG7MIKKKJFBAGEBBAA<
        @ERR063640.8 HS13_6902:1:1101:1462:2157/1 [barcode TTTCATGGTGAGAGTTATGTGCAAAAACCGGAACTCGAACGGAATTCGAAGCAAGGTATCAAA]
        ACGAATGCGTGGCGCGTCCTTCGGGGACTGATCAACGAAGACACCTTTACCTTTACCTTTTATACTAAAATTTATCTCCTGAAAGGAGAACTTGTAACA
        +
        CCGHIIJEKKKLIMJL@FJJL?BKJO=KH?;HCHNCDAK=K<LHHDJLEBGFLKHEHKLKH9H@FME9KHBJL7H?L8LE8E6KEDCG?L@BG@BAB?A
        Last edited by gringer; 03-03-2017, 11:24 AM.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          Slightly modifying code that @gringer supplied.

          Disclaimer: Please verify that the results look right with your data.

          Code:
          paste -d '~' <(cat R1.fq) <(cat R2.fq) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= "$F[1]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;' > WithBarcode_R1.fq
          If your files look like this

          R1.fq

          Code:
          @FCID:1:1101:15473:1334 1:N:0:
          AGTGGACTAGGGGATGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGTAGTGCTTTGGAAACTGTGCAGCTCGAGTGCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACTGTAACT
          +
          AAAABFFFFFFCGGGGGGGGGGGGGGGGGGGGGHHHHHHGHHGGGHGHGGGGHHHGGGGGHHHHHHHHGGGGHHHGHHGGGGGGGGGGGGHHHHHHHGHGHHHHHHHHFHHHHHHGGGGHHHHGGGGGHHHHHHHHHHGHHHHHHFHHFHGGGGDFHHHHH.EGGGBFFGGGGGGEFFFGGGGFFGGGF-DFEFFFFFFA.-./FFFFBFFFBFFFFFFA?;/B?F@DCFEAAF-@FFBBBBFFEFFFB;
          @FCID:1:1101:15528:1336 1:N:0:
          GAATTGGACGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGGGAGCAGGCGGCAGCAAAGGTCTGTGGTGAAAGACTGAAGCTTAACTTCAGTAAGCCATAGAAACCGGGCAGCTAGAGTGCAGGAGAGGATCGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGACGATCTGGCCTGCAACTGAC
          +
          DDDDDFFFFCDCGGGGGGGGGGHGGGGGGGHHHHHHHGHHGHHHGHGGGGHHHGGGGGHHHHHHHHGGGGHHGHHGGGGHHHGGGGGGGHHHHGGHHHHHHHGHHHHHHHHHHHHGHHHGHGHHHHHHHHHHHHHHHHHHGGGGGGGHHHHHGHGHHHGGHGDHHGDFFGGGGGGGGGGFGGGFGGG9?EGFGGFFAD;EFFFFFFFFFFFFFFFDEEFFFFFFF-DE->CFFEEAFFFFFFFBFFFFF0
          R2.fq (barcodes)
          Code:
          @FCID:1:1101:15473:1334 2:N:0:
          TATTTGCGACAA
          +
          #>>>ABFFBBBBG
          @FCID:1:1101:15528:1336 2:N:0:
          GCGGGAAAAAAA
          +
          #############
          File you want

          Code:
          @FCID:1:1101:15473:1334 1:N:0:TATTTGCGACAA
          AGTGGACTAGGGGATGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGTAGTGCTTTGGAAACTGTGCAGCTCGAGTGCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACTGTAACT
          +
          AAAABFFFFFFCGGGGGGGGGGGGGGGGGGGGGHHHHHHGHHGGGHGHGGGGHHHGGGGGHHHHHHHHGGGGHHHGHHGGGGGGGGGGGGHHHHHHHGHGHHHHHHHHFHHHHHHGGGGHHHHGGGGGHHHHHHHHHHGHHHHHHFHHFHGGGGDFHHHHH.EGGGBFFGGGGGGEFFFGGGGFFGGGF-DFEFFFFFFA.-./FFFFBFFFBFFFFFFA?;/B?F@DCFEAAF-@FFBBBBFFEFFFB;
          @FCID:1:1101:15528:1336 1:N:0:GCGGGAAAAAAA
          GAATTGGACGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGGGAGCAGGCGGCAGCAAAGGTCTGTGGTGAAAGACTGAAGCTTAACTTCAGTAAGCCATAGAAACCGGGCAGCTAGAGTGCAGGAGAGGATCGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGACGATCTGGCCTGCAACTGAC
          +
          DDDDDFFFFCDCGGGGGGGGGGHGGGGGGGHHHHHHHGHHGHHHGHGGGGHHHGGGGGHHHHHHHHGGGGHHGHHGGGGHHHGGGGGGGHHHHGGHHHHHHHGHHHHHHHHHHHHGHHHGHGHHHHHHHHHHHHHHHHHHGGGGGGGHHHHHGHGHHHGGHGDHHGDFFGGGGGGGGGGFGGGFGGG9?EGFGGFFAD;EFFFFFFFFFFFFFFFDEEFFFFFFF-DE->CFFEEAFFFFFFFBFFFFF0
          If your files are compressed then use

          Code:
          paste -d '~' <(zcat R1.fq.gz) <(zcat R2.fq.gz) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= "$F[1]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;' | gzip - > WithBarcode_R1.fq.gz
          Last edited by GenoMax; 03-03-2017, 12:12 PM.

          Comment

          • PeatMaster
            Junior Member
            • Mar 2016
            • 8

            #6
            Thanks so much to the two of you. This looks to be exactly what I want. I will be sure to verify the output. This will open up a whole new swath of tools for me to use.

            Thanks

            Comment

            • Luckyboy7
              Junior Member
              • Dec 2019
              • 1

              #7
              Hi,

              I am almost in the same boat. I have two fastQ file for each index (index i7 and i5) and two pair-end raw reads file (R1 and R2). Last time when I had two inline barcodes i7+i5 on the header, I used bbMap conveniently and demultiplexed them. But, I am not sure how can I add these two barcode indexes and bring them as inline header in the raw sequence files so that I can demultiplex my data as before. Is there anyway I can use bbmap to demultiplex my seq data whether my barcodes are in separate files?

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                @Luckyboy7: Use deML package to demux your data. More info here.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                19 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                29 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-26-2026, 10:12 AM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Working...