Seqanswers Leaderboard Ad

**GenoMax** · 01-25-2016, 09:15 AM

I suppose you could do two things. Remove the 1st base (if it is always N, which is kind of odd, see below) from all reads and remove the first C from your barcode file.

Hypothesis: Reason that first base is an N is because every sequence in this case will actually start with C (and then have GT). I am surprised that this worked for 2nd base onwards. Having low nucleotide diversity like this is not recommended.

**smitra** · 01-25-2016, 09:25 AM

Thanks ... but how can I do so..any quick way ? Do you suggest to use fastx_trimmer?

**GenoMax** · 01-25-2016, 09:32 AM

Can you try replacing the first C with an N in your barcode file and see if fastx-splitter would accept that as a valid pattern and do demultiplexing?

If that does not work then you could use bbduk from bbmap suite (with forcetrimleft=1) option or HEADCROP:1 option for trimmomatic to remove that first base (which is N) from all reads.

**smitra** · 01-25-2016, 09:37 AM

Here first I tried to remove all N from the read files as you suggested earlier and also remove C from barcode file:

So now

Code:

 mitras$ less test_out_R1.fastq 

@M02542:124:000000000-AKFBJ:1:1101:13841:1000 1:N:0:5
GTACCCAAGGGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACANNCNNGTCGAACGGTAGCNCAGAGAGCTTGCTCTNGGNTGACGAGTGGCGGACGGGNGANTAATGTCTGGGAAACTGCCCGATGGAGGGGGATANCTACTGGANANNGNNGCTAATACCGCATAACGNCGCAAGACCAAAGAGGGNGANNTCAGGGCCTCTTGNCATCGGATGNNCCCAGATGGGATNGGCTTGTAGGTGAGGTAAGNGCTCACGCNGGCGACGATCCCTAGCTTGGNNGNGAGG
+
8ABCFGGGGGGGGEEGGGGGGGG<FGGGFFGFGFGFGGEG@FGEEGGCFGGGGG?##:##6:CFFGGGDG<CG#:CCFFGEGGGGFAFG#:<#:BBFF7FFGDGGGGGGGD#8+#+:BFGGGGGGGCFFGDGG<FGGGECCGDEGGGF@#611:D,>>#6##6##66<1CF@7FFFGEGF7E#41=8=EGFFG7*?CF>>#22##2*2;@;8C8CFC<#/2AC=E*:5##/2:CFCG+8**+#*1*1552<+*+0+8D6D4+#1**)**)*#*15/*//7>5:5<.*,*)0)##1#..73
@M02542:124:000000000-AKFBJ:1:1101:12174:1002 1:N:0:5
GTAACCAAGGGTTTGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACANNCNNGTCGAGGGGCAGCATTTCAGTTTGCTTGCNAANTGGAGATGGCGACCGGCGNACNGGTGAGTAACACGTATCCAACCTGCCGATAACTCNGGGATAGCNTNNCNNAAGAAAGATTGATACCCNATGGTATAATCAGACCGNATGGTCTTATTATTAAANAATTTCGGTNNTCGATGGGGATGNGTTCCATTAGGCAGTTGGTGTGTTAATGNCGCACCAAACCTTCCTGTGANNGNGTTT
+
8ACCGGGGGGGCFGGGGGGGGGGGGGGGGGGGFGGGGDGGGGGGGGGFGGGGGGG##:##6:CFGFDEGGGGDGGGFGGGFGGGGGGGG#:C#66=,CFFFGGGG@FGEE7#++#:BBFFGGGFCFGGGGGGCGDGGGFGGGGGGGGC=#8@<<<FGG#5##8##86DCF<FCCC:BFCFFF#6>F>FGG92;@CFFGF@#116*=CF<CG?@CFFFG#3;5375:CG##212**<5C5/::#11:91A>+<>C6CE<FC:*****0:FB<#1*)//75<F30762*-2)**##1#0)0.

And the barcode file:

#SampleID BarcodeSequence
AP1E GTAACCA
AP25E GTACCCA
AP5D GTAAGAA
AP8C GTAGATA
P29F GTAGGCT
P30N GTATTCA
P31B GTCAAGA
P35C GTATTTC
V2A GTCCAGG
V3J GTCACAG

Still I am having the same problem :

N85567:testdata mitras$ cat test_out_R1.fastq | fastx_barcode_splitter.pl --bcfile mapping2_bcfile_new.txt --prefix /Volumes/Cristina/Mr.DNA_2016/fastq_files/testdata/ --bol --mismatches 0
Barcode Count Location
unmatched 1000 /Volumes/Cristina/Mr.DNA_2016/fastq_files/testdata/unmatched
total 1000

**smitra** · 01-25-2016, 09:42 AM

And also the same is happening if I replace the first C with an N in barcode file

#SampleID BarcodeSequence
AP1E NGTAACCA
AP25E NGTACCCA
AP5D NGTAAGAA
AP8C NGTAGATA
P29F NGTAGGCT
P30N NGTATTCA
P31B NGTCAAGA
P35C NGTATTTC
V2A NGTCCAGG
V3J NGTCACAG

N85567:testdata mitras$ cat test_out_R1.fastq | fastx_barcode_splitter.pl --bcfile mapping2_bcfile_new.txt --prefix /Volumes/Cristina/Mr.DNA_2016/fastq_files/testdata/ --bol --mismatches 0
Barcode Count Location
unmatched 1000 /Volumes/Cristina/Mr.DNA_2016/fastq_files/testdata/unmatched
total 1000

**GenoMax** · 01-25-2016, 09:45 AM

Is your barcode file tab separated format? Hard to tell from the post.

**smitra** · 01-25-2016, 09:52 AM

Yes it is tab separated ... I just copied and pasted here...thus looks like so

**GenoMax** · 01-25-2016, 09:56 AM

Originally posted by smitra View Post

And also the same is happening if I replace the first C with an N in barcode file

The N idea does not work but otherwise I am able to split your example reads into two files. What version of fastx_toolkit are you using?

**GenoMax** · 01-25-2016, 09:59 AM

Barcode file with 2 sequences you posted earlier.

Code:

#SampleID       BarcodeSequence
tse1    GTACCCAA
tse2    GTAACCAA

Command I used

Code:

$ cat test.fq | fastx_barcode_splitter.pl --bcfile bar --prefix /path_to/ --bol --mismatches 0
Barcode Count   Location
tse1    1       /path_to/tse1
tse2    1       /path_to/tse2
unmatched       0       /path_to/unmatched

**smitra** · 01-25-2016, 10:02 AM

hmm looks good. May be I will re-create the barcode file again.
My version fastx_toolkit-0.0.14
Thanks for helping

**GenoMax** · 01-25-2016, 10:05 AM

If you made the barcode file on a PC/Mac it may have some additional (invisible) characters. Use the dos2unix (or dos2unix -c for OSX files) utility to remove those characters or just create the file on the server.

**smitra** · 01-25-2016, 10:07 AM

Yes re-creating the barcode files works..there must have some problem with tab...
Now I got more realistic result

N85567:testdata mitras$ cat test_out_R1.fastq | fastx_barcode_splitter.pl --bcfile mapping2_bcfile_new.txt --prefix /Volumes/Promise\ Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/ --bol --mismatches 0
Barcode Count Location
AP1E 63 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP1E
AP25E 57 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP25E
AP5D 39 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP5D
AP8C 27 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP8C
P29F 40 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P29F
P30N 40 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P30N
P31B 38 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P31B
P35C 57 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P35C
V2A 37 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/V2A
V3J 25 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/V3J
unmatched 577 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/unmatched

Thank you so very much

**smitra** · 01-26-2016, 03:05 AM

Now as I am able to match files with the barcode splitter, I tried different combinations.... as with keeping N in the seq and with mismatch --0 works also.
But the problem is only half of my reads match.

N85567:testdata mitras$ cat test_R1.fastq | fastx_barcode_splitter.pl --bcfile mapping2_bcfile.txt --prefix /Volumes/Promise\ Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/ --suffix "_R1.fastq" --bol --mismatches 0
Barcode Count Location
AP1E 61 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP1E_R1.fastq
AP25E 55 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP25E_R1.fastq
AP5D 37 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP5D_R1.fastq
AP8C 27 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/AP8C_R1.fastq
P29F 40 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P29F_R1.fastq
P30N 40 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P30N_R1.fastq
P31B 35 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P31B_R1.fastq
P35C 55 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/P35C_R1.fastq
V2A 36 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/V2A_R1.fastq
V3J 25 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/V3J_R1.fastq
unmatched 589 /Volumes/Promise Pegasus/IFR/Cristina/Mr.DNA_2016/fastq_files/testdata/unmatched_R1.fastq
total 1000

So having 589/1000 reads unmatched is not a good option.
When I checked few reads they looks fine
there is no reason for the first few lines not to be matched (barcode file is copied again bellow and bold letter for the barcode that should be matched in first two lines).

Code:

N85567:unmatched_try mitras$ less unmatched.fastq 

@M02542:124:000000000-AKFBJ:1:1101:13841:1000 1:N:0:5
NGTACCCAAGGGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACANNCNNGTCGAACGGTAGCNCAGAGAGCTTGCTCTNGGNTGACGAGTGGCGGACGGGNGANTAATGTCTGGGAAACTGCCCGATGGAGGGGGATANCTACTGGANANNGNNGCTAATACCGCATAACGNCGCAAGACCAAAGAGGGNGANNTCAGGGCCTCTTGNCATCGGATGNNCCCAGATGGGATNGGCTTGTAGGTGAGGTAAGNGCTCACGCNGGCGACGATCCCTAGCTTGGNNGNGAGG
+
#8ABCFGGGGGGGGEEGGGGGGGG<FGGGFFGFGFGFGGEG@FGEEGGCFGGGGG?##:##6:CFFGGGDG<CG#:CCFFGEGGGGFAFG#:<#:BBFF7FFGDGGGGGGGD#8+#+:BFGGGGGGGCFFGDGG<FGGGECCGDEGGGF@#611:D,>>#6##6##66<1CF@7FFFGEGF7E#41=8=EGFFG7*?CF>>#22##2*2;@;8C8CFC<#/2AC=E*:5##/2:CFCG+8**+#*1*1552<+*+0+8D6D4+#1**)**)*#*15/*//7>5:5<.*,*)0)##1#..73
@M02542:124:000000000-AKFBJ:1:1101:12174:1002 1:N:0:5
NGTAACCAAGGGTTTGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACANNCNNGTCGAGGGGCAGCATTTCAGTTTGCTTGCNAANTGGAGATGGCGACCGGCGNACNGGTGAGTAACACGTATCCAACCTGCCGATAACTCNGGGATAGCNTNNCNNAAGAAAGATTGATACCCNATGGTATAATCAGACCGNATGGTCTTATTATTAAANAATTTCGGTNNTCGATGGGGATGNGTTCCATTAGGCAGTTGGTGTGTTAATGNCGCACCAAACCTTCCTGTGANNGNGTTT
+
#8ACCGGGGGGGCFGGGGGGGGGGGGGGGGGGGFGGGGDGGGGGGGGGFGGGGGGG##:##6:CFGFDEGGGGDGGGFGGGFGGGGGGGG#:C#66=,CFFFGGGG@FGEE7#++#:BBFFGGGFCFGGGGGGCGDGGGFGGGGGGGGC=#8@<<<FGG#5##8##86DCF<FCCC:BFCFFF#6>F>FGG92;@CFFGF@#116*=CF<CG?@CFFFG#3;5375:CG##212**<5C5/::#11:91A>+<>C6CE<FC:*****0:FB<#1*)//75<F30762*-2)**##1#0)0.

#SampleID BarcodeSequence
AP1E CGTAACCA
AP25E CGTACCCA
AP5D CGTAAGAA
AP8C CGTAGATA
P29F CGTAGGCT
P30N CGTATTCA
P31B CGTCAAGA
P35C CGTATTTC
V2A CGTCCAGG
V3J CGTCACAG

I don’t have any idea why this is getting so messy. And getting half of the reads matched is not a good idea. So just still keeping this post alive for any further help. Thanks a lot. smitra

**GenoMax** · 01-26-2016, 04:10 AM

Getting rid of the first base from all reads (hopefully that is consistently N, have you spot checked?) and then removing the first C from your barcode file may be the way to go.

That said the example reads you posted above have a lot a N' in the middle of the reads (which is a bad sign, likely indicative of low nucleotide diversity during those cycles). Are you going to be able to use this data if it were to get demultiplexed?

If these are amplicons and such you should look into primer schemes that stagger the start by adding random bases so as to overcome the low nucleotide diversity issue.

**smitra** · 01-26-2016, 06:15 AM

Dear GenoMax,
Thanks. Yes I have tried getting rid of the first base from all reads (yes consistently N) and also keeping them as it is. But both the cases my matching success rate is about 50%. Yes I know few first base pairs have more N in the middle but that will be discarded later in QC protocol. Thus at the first stage I am trying to get as much as read i can for each sample. Do you think I should try allowing more mismatch or partial option. As with the first lines I copies it still should be matched.
Getting really confused with this function.
Thanks,
mitra

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News