SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   bbduk and removing adapters of varying length (http://seqanswers.com/forums/showthread.php?t=92446)

unionicola 01-09-2020 02:40 PM

bbduk and removing adapters of varying length
 
I am analyzing some published Tn-seq data. There appears to be residual transposon sequence in the reads, preventing alignment. Unfortunately, this sequence is of variable length. Here are some example reads, the transposon sequence is undrelined:

ATTCCGCTCTTCCGATCTAGTCATGCGCGGCCGCATAACATAACCGGTTGGATGATAAGTCCCCGGTCTATAT
ATTCTTCCCTACACGACGCTCTTCCGATCTAGTCATGCGCCGGAGCATTAGGTAACAGGTTGGATGATAAGTC
ATGCAGTCATGCAAATGATAACAGGTTGGATGATAAGTCCCCGGTCTATATTGAGAGTAACTACATTTACCGT
ATTCATCATTGCGGCAGTCATGCCTATTGTTCCTGGTGTAACAGGTTGGATGATAAGTCCCCGGTCTATATTG


I want to search for the last 8 bases of the transposon sequence (AGTCATGC) and remove any sequence to the left of it, but retaining all the remaining sequence in the read and any read wherein there is no transposon sequence. I've been trying bbduk using the following command:

Code:

bbduk.sh in=$in.fastq literal=AGTCATGC ktrim=l k=8 rcomp=f out=$out.fastq
But this seems to result in nearly every read being removed (the values drop from over 4 million to about 20,000).

Can any one help me with this issue? Thanks!

SNPsaurus 01-09-2020 11:45 PM

I just tried your code on a test set with a random kmer in the 3rd read in bold

Code:

@NGSNJ-086:222:GW191226409th:1:1103:5556:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
ACTCAAAAACTTTGCTTTCTCAACATTCACCTAAGTCTACATTAAAATTCTTGCATCTTCTAACTCGTGTTCGCTAAGTAATGAAGCATCGTTTATTCAACACTTTTTTTTTACCTAATGGAACTAATTAATTATGTTTATCTTTCTTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:F:FFFF:FFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5900:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
GTCCAACACATCAAGGAGTGATTGATGTGAAATGGGTATTTAGGAACAAAAAGGATGAAAATGCTGAAATAATCGGAAATAAGGCAAGATTAGTTGCCAAAGGTTACTGTCAACAAGAATGGATAGACTATGATGAGACCTATGCTCCAG
+
FFFFFFFFFFFFF,FF:FFFF::FFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5990:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
GTGTACGAATTTCACATTATCGAAAATCTCATTTGTTTGGAAACATTTCCTCTTGGTGTCTTCGAACATGTAAGAAATGTAAAATTCATACCAAAATCGGGTAAGAATTTCATAATTATATAAGTATATGGTTATTGGTATGAATATAAT
+
FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF

bbduk.sh in=test.fq literal=GTCTTCGA ktrim=l k=8 rcomp=f out=test_out.fastq
Input: 3 reads 450 bases.
KTrimmed: 1 reads (33.33%) 65 bases (14.44%)
Total Removed: 0 reads (0.00%) 65 bases (14.44%)
Result: 3 reads (100.00%) 385 bases (85.56%)

cat test_out.fastq
Code:

@NGSNJ-086:222:GW191226409th:1:1103:5556:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
ACTCAAAAACTTTGCTTTCTCAACATTCACCTAAGTCTACATTAAAATTCTTGCATCTTCTAACTCGTGTTCGCTAAGTAATGAAGCATCGTTTATTCAACACTTTTTTTTTACCTAATGGAACTAATTAATTATGTTTATCTTTCTTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:F:FFFF:FFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5900:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
GTCCAACACATCAAGGAGTGATTGATGTGAAATGGGTATTTAGGAACAAAAAGGATGAAAATGCTGAAATAATCGGAAATAAGGCAAGATTAGTTGCCAAAGGTTACTGTCAACAAGAATGGATAGACTATGATGAGACCTATGCTCCAG
+
FFFFFFFFFFFFF,FF:FFFF::FFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF
@NGSNJ-086:222:GW191226409th:1:1103:5990:31187 1:N:0:GAAGCGGCAC+CGGCTCTACT
ACATGTAAGAAATGTAAAATTCATACCAAAATCGGGTAAGAATTTCATAATTATATAAGTATATGGTTATTGGTATGAATATAAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF

So it worked on my set. Can you set outm= and look at what is being removed?

You could also try kmask=# to see if it is finding the kmer desired.

unionicola 01-10-2020 09:08 AM

Thank you for your help.

Previously, I was getting the following output results summary:

Code:

Input:                          29978820 reads                2188453860 bases.
KTrimmed:                      135110 reads (0.45%)        3873797 bases (0.18%)
Total Removed:          29968470 reads (99.97%)        2188332397 bases (99.99%)
Result:                        10350 reads (0.03%)        121463 bases (0.01%)

I'm not sure what I was doing differently, but when I repeated the search this morning using the following command (I need to search and remove multiple 8-mers):

Code:

bbduk.sh in=input.fastq literal=AGTCATGC,ACGTACTG,TGACTGCA,TCGACGAT,CTAGCATG,GACTGTAC ktrim=l k=8 rcomp=f outm=removed.fastq out=trimmed.fastq
It gave me the following results:

Code:

Input:                          29978820 reads                2188453860 bases.
KTrimmed:                      135110 reads (0.45%)        3873797 bases (0.18%)
Total Removed:          9706 reads (0.03%)        3873797 bases (0.18%)
Result:                        29969114 reads (99.97%)        2184580063 bases (99.82%)

When I look in removed.fastq, I only see the left most of the reads (from the indicated 8-mer). So it looks like it is working now, but I'm confused as to why/how.

Nonetheless, at least I'm getting the results I want! Thanks for your help again!


All times are GMT -8. The time now is 02:49 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.