SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie2 output fastq of paired reads where at least one in the pair aligns shawn.mek Bioinformatics 0 09-06-2013 11:24 AM
Samtools Flag to identify outward facing (mate-pair not paired end) reads lesander Bioinformatics 5 10-28-2011 01:24 PM
Base pair sequence order in paired reads seqmonkey Illumina/Solexa 2 10-07-2011 03:56 AM
Are the mate-pair reads always shadowed by their friends the paired-end (picture) seb567 Illumina/Solexa 1 07-15-2011 09:49 AM
How to check if reads are properly paired in mate-pair data? genepool_bee Bioinformatics 2 02-22-2011 01:07 AM

Reply
 
Thread Tools
Old 09-24-2014, 10:01 AM   #1
vas72985
Member
 
Location: New Orleans, LA USA

Join Date: Jul 2010
Posts: 20
Default Pulling out paired reads containing a specific sequence in one pair

I am wondering if anyone knows of an easy way (or tool) to take paired Illumina fastq data and isolate only those read pairs that contain a specific sequence (say a 20 or 30mer) as the first 20 (or 30) nucleotides of one of the reads.

I will have a 2x150 MiSeq library where some percentage of the total reads (the percentage that I want) will have one read in the pair that starts with a specific nucleotide sequence. Ideally I would like to collect all read pairs where one read in the pair starts with the specific sequence as well as have that sequence trimmed from the read containing it before I map the pairs to the genome.

Thanks.
vas72985 is offline   Reply With Quote
Old 09-24-2014, 10:16 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBDuk might work for that. It can bin (or trim) reads by the presence and absence of specific kmers, like this:

bbduk.sh in=reads.fq outm=matching.fq out=unmatching.fq literal=ATGTTACGTCT k=11

However, it looks at all of the kmers in a read, not just the first one. Does this sequence ever occur in the middle of the reads, and if so, what would you want to do with those reads?
Brian Bushnell is offline   Reply With Quote
Old 09-24-2014, 10:17 AM   #3
vas72985
Member
 
Location: New Orleans, LA USA

Join Date: Jul 2010
Posts: 20
Default

The sequence most likely always occurs at the beginning of the read, but I suppose if something were to be slightly off with the prep, it could occur later in the read. In that case I would also like to pull out those reads. So if I understand correctly, this might work for that purpose?
vas72985 is offline   Reply With Quote
Old 09-24-2014, 10:21 AM   #4
vas72985
Member
 
Location: New Orleans, LA USA

Join Date: Jul 2010
Posts: 20
Default

However, does BBDuk allow for paired data? Ie, if the kmer is in read 1, will it allow for isolation of read 1 reads containing the kmer but also of read 2 pairs for those reads it identifies?)
vas72985 is offline   Reply With Quote
Old 09-24-2014, 10:55 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

1) Yes, it will work perfectly, in this case.
2) BBDuk always keeps pairs together, as long as it knows the input is paired. For twin files, the command would be:

bbduk.sh in1=reads1.fq in2=reads2.fq outm1=matching1.fq outm2=matching2.fq out1=unmatching1.fq out2=unmatching2.fq literal=ATGTTACGTCT k=11

You can later trim the reads with the "ktrim=l" flag.
Brian Bushnell is offline   Reply With Quote
Old 09-24-2014, 11:28 AM   #6
vas72985
Member
 
Location: New Orleans, LA USA

Join Date: Jul 2010
Posts: 20
Default

So I tried this on a very small test data set where I artificially inserted a specific 12mer (GACCAGCTAGTG) and it found all of the ones that I artificially inserted (as well as one that I didn't realize was there to begin with), but it also output a few read pairs as matches that look like they shouldn't belong. For example the read pair below:

@IRIS:7:32:32:1772#0/1
AAGGCTTTAGTCATGTGTTCAAGATCGAAAAAGGAA
+
aaaaaaaaaa`abab`a^aabaaa`ab`a`aaa`]a

@IRIS:7:32:32:1772#0/2
GAAGAAACCTCACAAGACTTTCACTAGATGGTCAGA
+
abbbaab^aaa``_aaa]`^_Z\X`W]^_a_TQ[]Z


Any ideas why it would be making some improper calls?
vas72985 is offline   Reply With Quote
Old 09-24-2014, 11:33 AM   #7
vas72985
Member
 
Location: New Orleans, LA USA

Join Date: Jul 2010
Posts: 20
Default

Basically it found all 11 sequences that I know match the 12mer, but it also pulled out an additional 9 sequences that I have no idea why they are being called matching.
vas72985 is offline   Reply With Quote
Old 09-24-2014, 12:45 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh - by default, it looks for both a kmer AND its reverse compliment, and ignores the middle letter of the kmer to increase sensitivity. To disable these, add these flags:

rcomp=f mm=f

(where rcomp means 'look for reverse-compliments of kmers' and mm means 'mask middle').

In this case, reverse-compliment of GACCAGCTAGTG = CACTAGCTGGTC, and:
Code:
                     CACTAGCTGGTC
GAAGAAACCTCACAAGACTTTCACTAGATGGTCAGA
...the middle base is masked. So it matches read 2.
Brian Bushnell is offline   Reply With Quote
Old 09-24-2014, 12:58 PM   #9
vas72985
Member
 
Location: New Orleans, LA USA

Join Date: Jul 2010
Posts: 20
Default

Ah, brilliant. Now it works like a charm. Thanks for the help. I'll give it a try on my actual dataset whenever I get it back. Now if only you could make that happen faster
vas72985 is offline   Reply With Quote
Old 09-24-2014, 02:08 PM   #10
vas72985
Member
 
Location: New Orleans, LA USA

Join Date: Jul 2010
Posts: 20
Default

Now I may be getting greedy, but is there an option that would allow me to set a threshold for mismatches between my sequence and kmer. For example I know my kmer is exact, but it's possible that I would want to allow 1 or 2 mismatches from my kmer in sequences and still have them be called "matched". Is there an option for this? It wasn't immediately obvious to me looking at the usage.

Thanks
vas72985 is offline   Reply With Quote
Old 09-24-2014, 02:38 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes! It is possible. You can set "hdist=1" for one mismatch or "hdist=2" for 2 (that stands for Hamming distance). You can also allow indels but that shouldn't be necessary with Illumina data.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
fastq, read trimming

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO