Seqanswers Leaderboard Ad

**Brian Bushnell** · 09-24-2014, 10:16 AM

BBDuk might work for that. It can bin (or trim) reads by the presence and absence of specific kmers, like this:

bbduk.sh in=reads.fq outm=matching.fq out=unmatching.fq literal=ATGTTACGTCT k=11

However, it looks at all of the kmers in a read, not just the first one. Does this sequence ever occur in the middle of the reads, and if so, what would you want to do with those reads?

**vas72985** · 09-24-2014, 10:17 AM

The sequence most likely always occurs at the beginning of the read, but I suppose if something were to be slightly off with the prep, it could occur later in the read. In that case I would also like to pull out those reads. So if I understand correctly, this might work for that purpose?

**vas72985** · 09-24-2014, 10:21 AM

However, does BBDuk allow for paired data? Ie, if the kmer is in read 1, will it allow for isolation of read 1 reads containing the kmer but also of read 2 pairs for those reads it identifies?)

**Brian Bushnell** · 09-24-2014, 10:55 AM

1) Yes, it will work perfectly, in this case.
2) BBDuk always keeps pairs together, as long as it knows the input is paired. For twin files, the command would be:

bbduk.sh in1=reads1.fq in2=reads2.fq outm1=matching1.fq outm2=matching2.fq out1=unmatching1.fq out2=unmatching2.fq literal=ATGTTACGTCT k=11

You can later trim the reads with the "ktrim=l" flag.

**vas72985** · 09-24-2014, 11:28 AM

So I tried this on a very small test data set where I artificially inserted a specific 12mer (GACCAGCTAGTG) and it found all of the ones that I artificially inserted (as well as one that I didn't realize was there to begin with), but it also output a few read pairs as matches that look like they shouldn't belong. For example the read pair below:

@IRIS:7:32:32:1772#0/1
AAGGCTTTAGTCATGTGTTCAAGATCGAAAAAGGAA
+
aaaaaaaaaa`abab`a^aabaaa`ab`a`aaa`]a

@IRIS:7:32:32:1772#0/2
GAAGAAACCTCACAAGACTTTCACTAGATGGTCAGA
+
abbbaab^aaa``_aaa]`^_Z\X`W]^_a_TQ[]Z

Any ideas why it would be making some improper calls?

**vas72985** · 09-24-2014, 11:33 AM

Basically it found all 11 sequences that I know match the 12mer, but it also pulled out an additional 9 sequences that I have no idea why they are being called matching.

**Brian Bushnell** · 09-24-2014, 12:45 PM

Oh - by default, it looks for both a kmer AND its reverse compliment, and ignores the middle letter of the kmer to increase sensitivity. To disable these, add these flags:

rcomp=f mm=f

(where rcomp means 'look for reverse-compliments of kmers' and mm means 'mask middle').

In this case, reverse-compliment of GACCAGCTAGTG = CACTAGCTGGTC, and:

Code:

                     [B]CACTAG[COLOR="Red"]C[/COLOR]TGGTC[/B]
GAAGAAACCTCACAAGACTTT[B]CACTAGATGGTC[/B]AGA

...the middle base is masked. So it matches read 2.

**vas72985** · 09-24-2014, 12:58 PM

Ah, brilliant. Now it works like a charm. Thanks for the help. I'll give it a try on my actual dataset whenever I get it back. Now if only you could make that happen faster

**vas72985** · 09-24-2014, 02:08 PM

Now I may be getting greedy, but is there an option that would allow me to set a threshold for mismatches between my sequence and kmer. For example I know my kmer is exact, but it's possible that I would want to allow 1 or 2 mismatches from my kmer in sequences and still have them be called "matched". Is there an option for this? It wasn't immediately obvious to me looking at the usage.

Thanks

**Brian Bushnell** · 09-24-2014, 02:38 PM

Yes! It is possible. You can set "hdist=1" for one mismatch or "hdist=2" for 2 (that stands for Hamming distance). You can also allow indels but that shouldn't be necessary with Illumina data.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Pulling out paired reads containing a specific sequence in one pair

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News