SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
paired-end adapter trimming vinay052003 Bioinformatics 16 05-02-2017 07:58 PM
Paired-end Illumina RNA-seq adapter trimming fabrice Bioinformatics 8 01-05-2015 07:48 AM
Illumina paired-end reads. More than 2 adapter sequences. RedLightPanic Illumina/Solexa 8 03-07-2013 12:27 PM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 02:06 AM
PerM is an ultra-fast and sensitive SOLiD reads mapping tool KevinLam Bioinformatics 7 06-18-2010 03:03 AM

Reply
 
Thread Tools
Old 09-23-2013, 01:16 AM   #1
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default skewer: A fast and sensitive adapter trimmer for paired-end reads

Hello everyone,

We've implemented a novel tool named skewer for adapter trimming. It is aimed for preprocessing Illumina Paired-end/single-end reads at the moment. The main features are as follows:
* Allow full-length adapter sequence trimming for higher specificity;
* Allow indel errors when finding adapter sequence;
* Very fast: Internally it uses a novel local alignment algorithm that has not been published. In single thread mode, it can process a pair of compressed files, whose uncompressed sizes were about 12G bytes each, in about 30 minutes; it is even faster in multi-thread mode, but the speedup is limited due to the parallelism is only made in the sequence alignment part.
* Quality values aware. It evaluates alignments based on sequence qualities.
* Paired information aware. It is more accurate in case of processing paired-end reads.

If you are interested in using it, please download it from
https://sourceforge.net/projects/skewer/

Any feedback or feature requests are welcome!

Last edited by relipmoc; 09-23-2013 at 04:53 PM. Reason: :)
relipmoc is offline   Reply With Quote
Old 01-14-2014, 06:10 AM   #2
BhariD
Member
 
Location: Georgia

Join Date: Mar 2013
Posts: 15
Default

Hi,

I have to trim full-length adapter sequences with zero number of mismatches. I do not want to trim reads on any other criteria at this point.

I am using the following command line:
./skewer-0.1.99-linux-x86_64 -x ACACTCTTTCCCTACACGACGCTCTTCCGATCT -y GATCGGAAGAGCGGTTCA
GCAGGAATGCCGAG -r 0 -d 0 -o exact_trim_15 -t 8 read_1.fastq paired_read2.fastq

Log file includes:
Parameters used:
-- 3' end adapter sequence (-x): ACACTCTTTCCCTACACGACGCTCTTCCGATCT
-- paired 3' end adapter sequence (-y): GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
-- maximum error ratio allowed (-r): 0.000
-- maximum indel error ratio allowed (-d): 0.000
-- minimum read length allowed after trimming (-l): 18
-- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
-- number of concurrent threads (-t): 8
Tue Jan 14 02:18:14 2014 >> started

Tue Jan 14 02:19:33 2014 >> done (78.699s)
47656840 read pairs processed; of these:
0 ( 0.00%) short read pairs filtered out after trimming by size control
0 ( 0.00%) empty read pairs filtered out after trimming by size control
47656840 (100.00%) read pairs available; of these:
3202 ( 0.01%) trimmed read pairs available after processing
47653638 (99.99%) untrimmed read pairs available after processing

Length distribution of reads after trimming:
length count percentage
97 1 0.00%
98 4 0.00%
99 3197 0.01%
100 47653638 99.99%


My questions are:
1) The 3197 read pairs trimmed, given the input parameter settings, are they really trimmed just based on exact full-length adapter sequence match? any default parameter that I should be aware of?
2) What is the overlap length for adapter detection in paired-end mode? is it like initial 17 bp of the total length? Is there a way I can change this?
3) How can I change the number of mismatches to detect the adapter region in the read? Let's say if I want to allow only 2 mismatches (instead of zero mismatches) in the full-length adapter sequence?
4) How can I specify multiple adapter sequences for read 1 and read 2 data files?
I would appreciate your help! Thank you!
BhariD is offline   Reply With Quote
Old 01-14-2014, 07:31 AM   #3
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

Thank you so much for your feedback!

Quick answers to your questions:
1) The searching process is based on exact full-length adapter sequence, but for the 3197 read pairs, only the last nucleotides were identified as the first nucleotides of corresponding adapter sequences. In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing.

2) There's no need to specify the overlap length in paired-end mode. The program knows how to do it correctly.

3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values. If you just want to specify the number of maximum allowed mismatches in the full-length adapter sequence, you can use fq2fa.sh to transfer the FASTQ files to FASTA files, and specify the maximum allowed error ratio (-r) as 2/33=0.06. For small RNA adapter trimming, it is something like the following command:
$ fq2fa.sh srnaReads.fq | skewer -x TCGTATGCCGTCTTCTGCTTGAAAAAAA -L 30 -r 0.06 -o trimmed -

4) For multiple adapter sequences, you just need to specify two FASTA files which contain adapter sequences, and input something like:
$ skewer -x adapters1.fa -y adapters2.fa flowcell1_lane7_pair1.fastq.gz flowcell1_lane7_pair2.fastq.gz
Attached Files
File Type: zip fq2fa.zip (286 Bytes, 22 views)

Last edited by relipmoc; 01-14-2014 at 07:46 AM.
relipmoc is offline   Reply With Quote
Old 01-14-2014, 12:00 PM   #4
BhariD
Member
 
Location: Georgia

Join Date: Mar 2013
Posts: 15
Default

Thank you for your prompt response!

I am sorry, I couldn't quite get the "In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing"? I don't think I have adapter more than 62 bp so then why its looking for last few nucleotides (3 I guess here?)?
BhariD is offline   Reply With Quote
Old 01-14-2014, 01:40 PM   #5
BhariD
Member
 
Location: Georgia

Join Date: Mar 2013
Posts: 15
Default skewer: A fast and sensitive adapter trimmer for paired-end reads

Also, what is the base quality value threshold used by the tool to be considered as a mismatch? in "3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values"

Thanks!
BhariD is offline   Reply With Quote
Old 01-14-2014, 04:29 PM   #6
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

As I said, "there's no need to specify the overlap length in paired-end mode", actually there's no parameter or default parameter for the overlap length in paired-end mode.

The 64 nt statement is irrelevant to your question. I just misunderstood your question "any default parameter that I should be aware of". ^_^

"why its looking for last few nucleotides (3 I guess here?)". Unfortunately your guess is not the truth. It's by chance that you got this result.

Quote:
Originally Posted by BhariD View Post
Thank you for your prompt response!

I am sorry, I couldn't quite get the "In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing"? I don't think I have adapter more than 62 bp so then why its looking for last few nucleotides (3 I guess here?)?
relipmoc is offline   Reply With Quote
Old 01-14-2014, 04:32 PM   #7
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

There's no base quality value threshold. That's all integrated into the statistical scheme. Since we have not published the paper, I can not tell you the details at the moment. Sorry for that!

Quote:
Originally Posted by BhariD View Post
Also, what is the base quality value threshold used by the tool to be considered as a mismatch? in "3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values"

Thanks!

Last edited by relipmoc; 01-14-2014 at 04:37 PM.
relipmoc is offline   Reply With Quote
Old 02-17-2014, 08:15 AM   #8
roryk
Member
 
Location: boston

Join Date: Aug 2010
Posts: 15
Default

Hi relipmoc,

I have a couple of questions:

1) How does skewer handle partial matches? For example if I have a sequence that goes SEQUENCE-ADAPTER-BARCODE, and I just input ADAPTER, will I end up with SEQUENCE?

2) Why is this sequence not being trimmed? Does skewer only match the entire adapter sequence?

@test_truseq/1
CGATGATCAAGACCCAAGTGTGAGATTACGGAGATCGGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@test_truseq/2
CGATGATCAAGACCCAAGTGTGAGATTACTCAGATCGGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

~/tmp/skewer-0.1.104-linux-x86_64 -x AGATCGGAAGAG -y AGATCGGAAGAG test_cutadapt_1.fastq test_cutadapt_2.fastq

Thanks! I've been looking around for a faster trimmer and was hoping skewer would be the solution.
roryk is offline   Reply With Quote
Old 02-17-2014, 03:59 PM   #9
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Quote:
Originally Posted by relipmoc View Post
There's no base quality value threshold. That's all integrated into the statistical scheme. Since we have not published the paper, I can not tell you the details at the moment. Sorry for that!
That's not a good way to get people to use your software!
frozenlyse is offline   Reply With Quote
Old 02-18-2014, 06:38 AM   #10
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

Quote:
Originally Posted by roryk View Post
1) How does skewer handle partial matches? For example if I have a sequence that goes SEQUENCE-ADAPTER-BARCODE, and I just input ADAPTER, will I end up with SEQUENCE?
The answer is Yes. However, if you want an improved specificity, you'd better use ADAPTER-BARCODE as the adapter sequence. Furthermore, if you want to demultiplex the reads, you can specify the --barcode option.

Quote:
Originally Posted by roryk View Post
2) ... Does skewer only match the entire adapter sequence?
The answer is No. skewer can detect partially matched adapter sequence at the 3' end (or 5' end if '-e 5' is specified).

Quote:
Originally Posted by roryk View Post
2) Why is this sequence not being trimmed? ...
@test_truseq/1
CGATGATCAAGACCCAAGTGTGAGATTACGGAGATCGGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@test_truseq/2
CGATGATCAAGACCCAAGTGTGAGATTACTCAGATCGGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
The sequences are not being trimmed because they are not as skewer expected. Is this from real data? Or could you explain why the paired sequences before adapter sequences are not reverse complementary to each other? Are they from mate-pair sequencing instead of paired-end sequencing?

Quote:
Originally Posted by roryk View Post
~/tmp/skewer-0.1.104-linux-x86_64 -x AGATCGGAAGAG -y AGATCGGAAGAG test_cutadapt_1.fastq test_cutadapt_2.fastq
There's no need to specify -y, if pair1 and pair2 share the same adapter sequence.

Quote:
Originally Posted by roryk View Post
Thanks! I've been looking around for a faster trimmer and was hoping skewer would be the solution.
My pleasure! Hope it will make your work easier.
relipmoc is offline   Reply With Quote
Old 02-18-2014, 06:58 AM   #11
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

Quote:
Originally Posted by frozenlyse View Post
That's not a good way to get people to use your software!
For those people who want to know technique details, I have to say sorry to them. However, you can't wait too long. I'll inform you once our submission is accepted. Thanks!
relipmoc is offline   Reply With Quote
Old 02-28-2014, 10:16 AM   #12
hartmaier
Member
 
Location: Pittsburgh

Join Date: Dec 2012
Posts: 12
Default

Quote:
Originally Posted by relipmoc View Post
For those people who want to know technique details, I have to say sorry to them. However, you can't wait too long. I'll inform you once our submission is accepted. Thanks!
Any chance you can release OSX binaries?
hartmaier is offline   Reply With Quote
Old 03-31-2014, 01:24 AM   #13
kidaaaa
Junior Member
 
Location: China

Join Date: Oct 2011
Posts: 2
Default

I have a question:

How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files.
kidaaaa is offline   Reply With Quote
Old 03-31-2014, 01:26 AM   #14
kidaaaa
Junior Member
 
Location: China

Join Date: Oct 2011
Posts: 2
Default

I have a question:

How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files.
kidaaaa is offline   Reply With Quote
Old 03-31-2014, 06:13 AM   #15
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

Quote:
Originally Posted by kidaaaa View Post
I have a question:

How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files.
Thank you for your question! Now skewer only output those pairs that are concordantly trimmed or untouched.
relipmoc is offline   Reply With Quote
Old 03-31-2014, 02:18 PM   #16
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 438
Default

Hi replimoc,

I just found the answer in another thread:
http://seqanswers.com/forums/showthread.php?t=41976

My use case should be no problem, I guess.

(can skewer trim multiple potential contaminant sequences in the same run? - it seems it is focused on adapter pairs similar to trimmomatic's palindrome trimming mode?)

Last edited by luc; 03-31-2014 at 02:23 PM.
luc is offline   Reply With Quote
Old 06-02-2014, 07:17 PM   #17
roryk
Member
 
Location: boston

Join Date: Aug 2010
Posts: 15
Default

Thanks replimoc, with 0.1.114 skewer trims off the test sequences i posted correctly. Thanks!
roryk is offline   Reply With Quote
Old 06-03-2014, 02:14 PM   #18
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 438
Default

Hi replimoc,

it seems to me several sequences in the reverse reads are escaping removal when using the length threshold on paired reads. The filtering for length works fine for the forward reads.

Does the "paired information aware" trimming option work when providing a single (-x) adapter file containing several adapter sequences?
luc is offline   Reply With Quote
Old 06-04-2014, 05:32 PM   #19
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

Hi Luc,
Thank you for your question! My answer goes as follows:

Quote:
Originally Posted by luc View Post
it seems to me several sequences in the reverse reads are escaping removal when using the length threshold on paired reads. The filtering for length works fine for the forward reads.
The length threshold -k does not influence the trimming result of paired-end data. Other length thresholds such as -l and -L do influence the results. Could you explain your case with more details?

Quote:
Originally Posted by luc View Post
Does the "paired information aware" trimming option work when providing a single (-x) adapter file containing several adapter sequences?
The answer is YES. However, you need to pay special attention on the trimming efficiency. The semantics of your case is to try n * n adapter combinations in adapter trimming, where n is the number of adapter sequences provided in the adapter file. If the adapter sequences share most of their content but differ in some region, e.g. 6-bp region for indexing, you may use degenerative characters in this region and specify one representative adapter sequence. For instance, if the content of the adapter file is:
Code:
>Index 1, ATCACG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>Index 2, CGATGT
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
>Index 3, TTAGGC
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
>Index 4, TGACCA
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
>Index 5, ACAGTG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
>Index 6, GCCAAT
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
you may specify -x TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG to improve trimming efficiency.
relipmoc is offline   Reply With Quote
Old 06-04-2014, 05:35 PM   #20
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Smile

Quote:
Originally Posted by roryk View Post
Thanks replimoc, with 0.1.114 skewer trims off the test sequences i posted correctly. Thanks!
Hi roryk, thank you for your feedback!
relipmoc is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO