SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract reads from paired-end fastq based on specific adapters with bbduk gspirito Bioinformatics 2 11-12-2019 06:37 AM
Removing contamination with BBDUK glfrey Bioinformatics 2 10-02-2017 04:08 AM
library preparation and removing adapters meriem Illumina/Solexa 3 08-17-2016 01:55 PM
Removing adapters but keep sequence after the adapter JBauer Bioinformatics 2 07-02-2015 12:38 AM
removing adapters sequences from ChIPseq data? johannes.rainer Illumina/Solexa 0 02-05-2010 06:50 AM

Reply
 
Thread Tools
Old 01-09-2020, 01:40 PM   #1
unionicola
Junior Member
 
Location: Wisconsin

Join Date: Feb 2009
Posts: 4
Default bbduk and removing adapters of varying length

I am analyzing some published Tn-seq data. There appears to be residual transposon sequence in the reads, preventing alignment. Unfortunately, this sequence is of variable length. Here are some example reads, the transposon sequence is undrelined:

ATTCCGCTCTTCCGATCTAGTCATGCGCGGCCGCATAACATAACCGGTTGGATGATAAGTCCCCGGTCTATAT
ATTCTTCCCTACACGACGCTCTTCCGATCTAGTCATGCGCCGGAGCATTAGGTAACAGGTTGGATGATAAGTC
ATGCAGTCATGCAAATGATAACAGGTTGGATGATAAGTCCCCGGTCTATATTGAGAGTAACTACATTTACCGT
ATTCATCATTGCGGCAGTCATGCCTATTGTTCCTGGTGTAACAGGTTGGATGATAAGTCCCCGGTCTATATTG


I want to search for the last 8 bases of the transposon sequence (AGTCATGC) and remove any sequence to the left of it, but retaining all the remaining sequence in the read and any read wherein there is no transposon sequence. I've been trying bbduk using the following command:

Code:
bbduk.sh in=$in.fastq literal=AGTCATGC ktrim=l k=8 rcomp=f out=$out.fastq
But this seems to result in nearly every read being removed (the values drop from over 4 million to about 20,000).

Can any one help me with this issue? Thanks!
unionicola is offline   Reply With Quote
Old 01-09-2020, 10:45 PM   #2
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

I just tried your code on a test set with a random kmer in the 3rd read in bold

Code:
@NGSNJ-086:222:GW191226409th:1:1103:5556:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
ACTCAAAAACTTTGCTTTCTCAACATTCACCTAAGTCTACATTAAAATTCTTGCATCTTCTAACTCGTGTTCGCTAAGTAATGAAGCATCGTTTATTCAACACTTTTTTTTTACCTAATGGAACTAATTAATTATGTTTATCTTTCTTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:F:FFFF:FFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5900:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
GTCCAACACATCAAGGAGTGATTGATGTGAAATGGGTATTTAGGAACAAAAAGGATGAAAATGCTGAAATAATCGGAAATAAGGCAAGATTAGTTGCCAAAGGTTACTGTCAACAAGAATGGATAGACTATGATGAGACCTATGCTCCAG
+
FFFFFFFFFFFFF,FF:FFFF::FFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5990:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
GTGTACGAATTTCACATTATCGAAAATCTCATTTGTTTGGAAACATTTCCTCTTGGTGTCTTCGAACATGTAAGAAATGTAAAATTCATACCAAAATCGGGTAAGAATTTCATAATTATATAAGTATATGGTTATTGGTATGAATATAAT
+
FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF
bbduk.sh in=test.fq literal=GTCTTCGA ktrim=l k=8 rcomp=f out=test_out.fastq
Input: 3 reads 450 bases.
KTrimmed: 1 reads (33.33%) 65 bases (14.44%)
Total Removed: 0 reads (0.00%) 65 bases (14.44%)
Result: 3 reads (100.00%) 385 bases (85.56%)

cat test_out.fastq
Code:
@NGSNJ-086:222:GW191226409th:1:1103:5556:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
ACTCAAAAACTTTGCTTTCTCAACATTCACCTAAGTCTACATTAAAATTCTTGCATCTTCTAACTCGTGTTCGCTAAGTAATGAAGCATCGTTTATTCAACACTTTTTTTTTACCTAATGGAACTAATTAATTATGTTTATCTTTCTTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:F:FFFF:FFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5900:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
GTCCAACACATCAAGGAGTGATTGATGTGAAATGGGTATTTAGGAACAAAAAGGATGAAAATGCTGAAATAATCGGAAATAAGGCAAGATTAGTTGCCAAAGGTTACTGTCAACAAGAATGGATAGACTATGATGAGACCTATGCTCCAG
+
FFFFFFFFFFFFF,FF:FFFF::FFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5990:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
ACATGTAAGAAATGTAAAATTCATACCAAAATCGGGTAAGAATTTCATAATTATATAAGTATATGGTTATTGGTATGAATATAAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF
So it worked on my set. Can you set outm= and look at what is being removed?

You could also try kmask=# to see if it is finding the kmer desired.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 01-10-2020, 08:08 AM   #3
unionicola
Junior Member
 
Location: Wisconsin

Join Date: Feb 2009
Posts: 4
Default

Thank you for your help.

Previously, I was getting the following output results summary:

Code:
Input:                  	29978820 reads 		2188453860 bases.
KTrimmed:               	135110 reads (0.45%) 	3873797 bases (0.18%)
Total Removed:          29968470 reads (99.97%) 	2188332397 bases (99.99%)
Result:                 	10350 reads (0.03%) 	121463 bases (0.01%)
I'm not sure what I was doing differently, but when I repeated the search this morning using the following command (I need to search and remove multiple 8-mers):

Code:
bbduk.sh in=input.fastq literal=AGTCATGC,ACGTACTG,TGACTGCA,TCGACGAT,CTAGCATG,GACTGTAC ktrim=l k=8 rcomp=f outm=removed.fastq out=trimmed.fastq
It gave me the following results:

Code:
Input:                  	29978820 reads 		2188453860 bases.
KTrimmed:               	135110 reads (0.45%) 	3873797 bases (0.18%)
Total Removed:          9706 reads (0.03%) 	3873797 bases (0.18%)
Result:                 	29969114 reads (99.97%) 	2184580063 bases (99.82%)
When I look in removed.fastq, I only see the left most of the reads (from the indicated 8-mer). So it looks like it is working now, but I'm confused as to why/how.

Nonetheless, at least I'm getting the results I want! Thanks for your help again!
unionicola is offline   Reply With Quote
Reply

Tags
bbduk, bbmap, fastq, tn-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO