Seqanswers Leaderboard Ad

**Brian Bushnell** · 09-24-2014, 10:16 AM

BBDuk might work for that. It can bin (or trim) reads by the presence and absence of specific kmers, like this:

bbduk.sh in=reads.fq outm=matching.fq out=unmatching.fq literal=ATGTTACGTCT k=11

However, it looks at all of the kmers in a read, not just the first one. Does this sequence ever occur in the middle of the reads, and if so, what would you want to do with those reads?

**vas72985** · 09-24-2014, 10:17 AM

The sequence most likely always occurs at the beginning of the read, but I suppose if something were to be slightly off with the prep, it could occur later in the read. In that case I would also like to pull out those reads. So if I understand correctly, this might work for that purpose?

**vas72985** · 09-24-2014, 10:21 AM

However, does BBDuk allow for paired data? Ie, if the kmer is in read 1, will it allow for isolation of read 1 reads containing the kmer but also of read 2 pairs for those reads it identifies?)

**Brian Bushnell** · 09-24-2014, 10:55 AM

1) Yes, it will work perfectly, in this case.
2) BBDuk always keeps pairs together, as long as it knows the input is paired. For twin files, the command would be:

bbduk.sh in1=reads1.fq in2=reads2.fq outm1=matching1.fq outm2=matching2.fq out1=unmatching1.fq out2=unmatching2.fq literal=ATGTTACGTCT k=11

You can later trim the reads with the "ktrim=l" flag.

**vas72985** · 09-24-2014, 11:28 AM

So I tried this on a very small test data set where I artificially inserted a specific 12mer (GACCAGCTAGTG) and it found all of the ones that I artificially inserted (as well as one that I didn't realize was there to begin with), but it also output a few read pairs as matches that look like they shouldn't belong. For example the read pair below:

@IRIS:7:32:32:1772#0/1
AAGGCTTTAGTCATGTGTTCAAGATCGAAAAAGGAA
+
aaaaaaaaaa`abab`a^aabaaa`ab`a`aaa`]a

@IRIS:7:32:32:1772#0/2
GAAGAAACCTCACAAGACTTTCACTAGATGGTCAGA
+
abbbaab^aaa``_aaa]`^_Z\X`W]^_a_TQ[]Z

Any ideas why it would be making some improper calls?

**vas72985** · 09-24-2014, 11:33 AM

Basically it found all 11 sequences that I know match the 12mer, but it also pulled out an additional 9 sequences that I have no idea why they are being called matching.

**Brian Bushnell** · 09-24-2014, 12:45 PM

Oh - by default, it looks for both a kmer AND its reverse compliment, and ignores the middle letter of the kmer to increase sensitivity. To disable these, add these flags:

rcomp=f mm=f

(where rcomp means 'look for reverse-compliments of kmers' and mm means 'mask middle').

In this case, reverse-compliment of GACCAGCTAGTG = CACTAGCTGGTC, and:

Code:

                     [B]CACTAG[COLOR="Red"]C[/COLOR]TGGTC[/B]
GAAGAAACCTCACAAGACTTT[B]CACTAGATGGTC[/B]AGA

...the middle base is masked. So it matches read 2.

**vas72985** · 09-24-2014, 12:58 PM

Ah, brilliant. Now it works like a charm. Thanks for the help. I'll give it a try on my actual dataset whenever I get it back. Now if only you could make that happen faster

**vas72985** · 09-24-2014, 02:08 PM

Now I may be getting greedy, but is there an option that would allow me to set a threshold for mismatches between my sequence and kmer. For example I know my kmer is exact, but it's possible that I would want to allow 1 or 2 mismatches from my kmer in sequences and still have them be called "matched". Is there an option for this? It wasn't immediately obvious to me looking at the usage.

Thanks

**Brian Bushnell** · 09-24-2014, 02:38 PM

Yes! It is possible. You can set "hdist=1" for one mismatch or "hdist=2" for 2 (that stands for Hamming distance). You can also allow indels but that shouldn't be necessary with Illumina data.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Pulling out paired reads containing a specific sequence in one pair

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News