SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Adding barcode indexes back into FASTQ headers? (http://seqanswers.com/forums/showthread.php?t=74570)

PeatMaster 03-03-2017 06:58 AM

Adding barcode indexes back into FASTQ headers?
 
Hi,

I have a set of fastq files from a fungal ITS2 amplicon run on a MiSeq. There are 3 corresponding fastq files: Read1, Read2 and an Index file. In the past, I have worked with fastq files where the Read 1 and Read 2 files (or when they are interleaved into one file) have the barcode indexes at the end of the fastq header lines in the actual Read 1 and Read 2 files; BBMap/BBDuk worked great for processing these files (e.g., adaptor removal, merging reads). However, in my current situation I have the barcodes in their own fastq file, and I can't seem to find a script in BBMap/BBDuk that accommodates my current situation.

My questions are:

Am I missing an argument or script in BBMap that will accommodate my situation?

Or, does anyone know of a function or script outside of BBMap that will take the barcodes indexes in the index file and put them pack into the end of the header lines in the read1 and read2 files? I have found scripts that go the opposite way (e.g., Qiime's extract_barcodes.py script).

Thanks for the assistance

gringer 03-03-2017 09:48 AM

Why do you want to put barcodes back into sequences? They are usually stripped out of the reads when the demultiplexing is carried out.

PeatMaster 03-03-2017 09:55 AM

Not back into the read, back into the header line
 
Hi, my goal is not to put them back into the sequences themselves but to put them at the end of the header line for each sequence in the fastq. Let me know if this is not clear and I can cut and paste some examples.

Thanks

gringer 03-03-2017 10:03 AM

Ah, right, that makes more sense. I wonder if 'paste' would help you here. Here's an example of what it can do:

Code:

$ paste -d '|' <(zcat ../ERR063640_1P.fastq.gz | head -n 16) <(zcat ../ERR063640_2P.fastq.gz | head -n 16)
@ERR063640.4 HS13_6902:1:1101:1393:2125/1|@ERR063640.4 HS13_6902:1:1101:1393:2125/2
TGGGAATCAGCCGATCGAGTGTACAAAATATAGTAGTTGGAAAACTAAGGCTGAGGAGCTATAAGCTGC|TTTAGAGAGTCCAAACTGTTGTGCCCGGTANNNNNNACCTTCTTCTCGAAAA
+|+
B>GGDFJEHBDLIIMJHIKBLLHIIGGKHI=ILHFAHJLMKKIDDKLGLIIFLKIFGKEFHH?@H>JLF|:DDEFH>BIIJGJKLJHKDCJHJJKLHKFB!!!!!!H@H8GKEKKGE8I-FM
@ERR063640.5 HS13_6902:1:1101:1453:2128/1|@ERR063640.5 HS13_6902:1:1101:1453:2128/2
TGAGGTGTGAGATGCGTGTCAGTGCAGAGCGAGAGAAAACGGCACGAAATAAGTGCCGGTGCTTCTCATGGAGACGACGATGGTTTGCGTCGCTT|TCACAGATGAGACGGTAAGCTCACAATGACCNNNTNATGTACAATTGTAAATC
+|+
CBC;DIJ>HFDLBIJLLJGJEJKKIMLKMJ<ILEIHJNJGDLLFJKFGKEIHLJHFMJGIHJLJFEJDKHGGHKJJLEJKBEL7I7C8HJ5FGF>|:DBEFHBHIIJJHILJHEKJKKI@KLHDKIJ!!!J!LFAJGLLKLIK8EKEF9
@ERR063640.7 HS13_6902:1:1101:1378:2148/1|@ERR063640.7 HS13_6902:1:1101:1378:2148/2
GAAATTTTCAAGAATTGACGGAAAACGAATTTCATTGCCGGGCATTCTAAGTGTTAAATTAGCATTGTACTCGTCGAGCTGAGTCGGATGATATGTGATA|CATATCGGCCCTATCACATATCATCCGACTCANCNNNACGAGTACAATACTAA
+|+
D>GGILJEJKKLJIKLLLJOLGHJIMKKMKEJHGKLDNFMLKLILKLLLGJNLKJLKKJIGCHHIMILHMGDLKLJ>KJKKHG7MIKKKJFBAGEBBAA<|:DFGFHGILJJLKILJKKKJIDIJLLHKKIOL!G!!!JKKHKEKKIKH9KLFI
@ERR063640.8 HS13_6902:1:1101:1462:2157/1|@ERR063640.8 HS13_6902:1:1101:1462:2157/2
ACGAATGCGTGGCGCGTCCTTCGGGGACTGATCAACGAAGACACCTTTACCTTTACCTTTTATACTAAAATTTATCTCCTGAAAGGAGAACTTGTAACA|TTTCATGGTGAGAGTTATGTGCAAAAACCGGAACTCGAACGGAATTCGAAGCAAGGTATCAAA
+|+
CCGHIIJEKKKLIMJL@FJJL?BKJO=KH?;HCHNCDAK=K<LHHDJLEBGFLKHEHKLKH9H@FME9KHBJL7H?L8LE8E6KEDCG?L@BG@BAB?A|9DDEFHGHIJJGHILJKKKJLMHJLLHKJNMHM>JKNOGKILIKKIEKIFLFIKJHLNJKJLH

This could be followed up with code that reads in 4 lines, then displays the left-hand column for the four lines, appending the right-hand column for the second line if the line number is 1. Not sure if that's what you want to do, but here's an example:

Code:

paste -d '~' <(zcat ../ERR063640_1P.fastq.gz | head -n 16) <(zcat ../ERR063640_2P.fastq.gz | head -n 16) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= " [barcode ".$F[1]."]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;'
@ERR063640.4 HS13_6902:1:1101:1393:2125/1 [barcode TTTAGAGAGTCCAAACTGTTGTGCCCGGTANNNNNNACCTTCTTCTCGAAAA]
TGGGAATCAGCCGATCGAGTGTACAAAATATAGTAGTTGGAAAACTAAGGCTGAGGAGCTATAAGCTGC
+
B>GGDFJEHBDLIIMJHIKBLLHIIGGKHI=ILHFAHJLMKKIDDKLGLIIFLKIFGKEFHH?@H>JLF
@ERR063640.5 HS13_6902:1:1101:1453:2128/1 [barcode TCACAGATGAGACGGTAAGCTCACAATGACCNNNTNATGTACAATTGTAAATC]
TGAGGTGTGAGATGCGTGTCAGTGCAGAGCGAGAGAAAACGGCACGAAATAAGTGCCGGTGCTTCTCATGGAGACGACGATGGTTTGCGTCGCTT
+
CBC;DIJ>HFDLBIJLLJGJEJKKIMLKMJ<ILEIHJNJGDLLFJKFGKEIHLJHFMJGIHJLJFEJDKHGGHKJJLEJKBEL7I7C8HJ5FGF>
@ERR063640.7 HS13_6902:1:1101:1378:2148/1 [barcode CATATCGGCCCTATCACATATCATCCGACTCANCNNNACGAGTACAATACTAA]
GAAATTTTCAAGAATTGACGGAAAACGAATTTCATTGCCGGGCATTCTAAGTGTTAAATTAGCATTGTACTCGTCGAGCTGAGTCGGATGATATGTGATA
+
D>GGILJEJKKLJIKLLLJOLGHJIMKKMKEJHGKLDNFMLKLILKLLLGJNLKJLKKJIGCHHIMILHMGDLKLJ>KJKKHG7MIKKKJFBAGEBBAA<
@ERR063640.8 HS13_6902:1:1101:1462:2157/1 [barcode TTTCATGGTGAGAGTTATGTGCAAAAACCGGAACTCGAACGGAATTCGAAGCAAGGTATCAAA]
ACGAATGCGTGGCGCGTCCTTCGGGGACTGATCAACGAAGACACCTTTACCTTTACCTTTTATACTAAAATTTATCTCCTGAAAGGAGAACTTGTAACA
+
CCGHIIJEKKKLIMJL@FJJL?BKJO=KH?;HCHNCDAK=K<LHHDJLEBGFLKHEHKLKH9H@FME9KHBJL7H?L8LE8E6KEDCG?L@BG@BAB?A


GenoMax 03-03-2017 10:47 AM

Slightly modifying code that @gringer supplied.

Disclaimer: Please verify that the results look right with your data.

Code:

paste -d '~' <(cat R1.fq) <(cat R2.fq) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= "$F[1]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;' > WithBarcode_R1.fq
If your files look like this

R1.fq

Code:

@FCID:1:1101:15473:1334 1:N:0:
AGTGGACTAGGGGATGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGTAGTGCTTTGGAAACTGTGCAGCTCGAGTGCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACTGTAACT
+
AAAABFFFFFFCGGGGGGGGGGGGGGGGGGGGGHHHHHHGHHGGGHGHGGGGHHHGGGGGHHHHHHHHGGGGHHHGHHGGGGGGGGGGGGHHHHHHHGHGHHHHHHHHFHHHHHHGGGGHHHHGGGGGHHHHHHHHHHGHHHHHHFHHFHGGGGDFHHHHH.EGGGBFFGGGGGGEFFFGGGGFFGGGF-DFEFFFFFFA.-./FFFFBFFFBFFFFFFA?;/B?F@DCFEAAF-@FFBBBBFFEFFFB;
@FCID:1:1101:15528:1336 1:N:0:
GAATTGGACGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGGGAGCAGGCGGCAGCAAAGGTCTGTGGTGAAAGACTGAAGCTTAACTTCAGTAAGCCATAGAAACCGGGCAGCTAGAGTGCAGGAGAGGATCGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGACGATCTGGCCTGCAACTGAC
+
DDDDDFFFFCDCGGGGGGGGGGHGGGGGGGHHHHHHHGHHGHHHGHGGGGHHHGGGGGHHHHHHHHGGGGHHGHHGGGGHHHGGGGGGGHHHHGGHHHHHHHGHHHHHHHHHHHHGHHHGHGHHHHHHHHHHHHHHHHHHGGGGGGGHHHHHGHGHHHGGHGDHHGDFFGGGGGGGGGGFGGGFGGG9?EGFGGFFAD;EFFFFFFFFFFFFFFFDEEFFFFFFF-DE->CFFEEAFFFFFFFBFFFFF0

R2.fq (barcodes)
Code:

@FCID:1:1101:15473:1334 2:N:0:
TATTTGCGACAA
+
#>>>ABFFBBBBG
@FCID:1:1101:15528:1336 2:N:0:
GCGGGAAAAAAA
+
#############

File you want

Code:

@FCID:1:1101:15473:1334 1:N:0:TATTTGCGACAA
AGTGGACTAGGGGATGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGTAGTGCTTTGGAAACTGTGCAGCTCGAGTGCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACTGTAACT
+
AAAABFFFFFFCGGGGGGGGGGGGGGGGGGGGGHHHHHHGHHGGGHGHGGGGHHHGGGGGHHHHHHHHGGGGHHHGHHGGGGGGGGGGGGHHHHHHHGHGHHHHHHHHFHHHHHHGGGGHHHHGGGGGHHHHHHHHHHGHHHHHHFHHFHGGGGDFHHHHH.EGGGBFFGGGGGGEFFFGGGGFFGGGF-DFEFFFFFFA.-./FFFFBFFFBFFFFFFA?;/B?F@DCFEAAF-@FFBBBBFFEFFFB;
@FCID:1:1101:15528:1336 1:N:0:GCGGGAAAAAAA
GAATTGGACGAGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGGGAGCAGGCGGCAGCAAAGGTCTGTGGTGAAAGACTGAAGCTTAACTTCAGTAAGCCATAGAAACCGGGCAGCTAGAGTGCAGGAGAGGATCGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGACGATCTGGCCTGCAACTGAC
+
DDDDDFFFFCDCGGGGGGGGGGHGGGGGGGHHHHHHHGHHGHHHGHGGGGHHHGGGGGHHHHHHHHGGGGHHGHHGGGGHHHGGGGGGGHHHHGGHHHHHHHGHHHHHHHHHHHHGHHHGHGHHHHHHHHHHHHHHHHHHGGGGGGGHHHHHGHGHHHGGHGDHHGDFFGGGGGGGGGGFGGGFGGG9?EGFGGFFAD;EFFFFFFFFFFFFFFFDEEFFFFFFF-DE->CFFEEAFFFFFFFBFFFFF0

If your files are compressed then use

Code:

paste -d '~' <(zcat R1.fq.gz) <(zcat R2.fq.gz) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= "$F[1]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;' | gzip - > WithBarcode_R1.fq.gz

PeatMaster 03-17-2017 10:40 AM

Thanks so much to the two of you. This looks to be exactly what I want. I will be sure to verify the output. This will open up a whole new swath of tools for me to use.

Thanks

Luckyboy7 12-20-2019 06:56 AM

Hi,

I am almost in the same boat. I have two fastQ file for each index (index i7 and i5) and two pair-end raw reads file (R1 and R2). Last time when I had two inline barcodes i7+i5 on the header, I used bbMap conveniently and demultiplexed them. But, I am not sure how can I add these two barcode indexes and bring them as inline header in the raw sequence files so that I can demultiplex my data as before. Is there anyway I can use bbmap to demultiplex my seq data whether my barcodes are in separate files?

GenoMax 12-20-2019 08:11 AM

@Luckyboy7: Use deML package to demux your data. More info here.


All times are GMT -8. The time now is 01:12 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.