![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
paired-end adapter trimming | vinay052003 | Bioinformatics | 16 | 05-02-2017 08:58 PM |
Paired-end Illumina RNA-seq adapter trimming | fabrice | Bioinformatics | 8 | 01-05-2015 08:48 AM |
Illumina paired-end reads. More than 2 adapter sequences. | RedLightPanic | Illumina/Solexa | 8 | 03-07-2013 01:27 PM |
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? | danwiththeplan | Bioinformatics | 2 | 09-22-2011 03:06 AM |
PerM is an ultra-fast and sensitive SOLiD reads mapping tool | KevinLam | Bioinformatics | 7 | 06-18-2010 04:03 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]()
Hello everyone,
We've implemented a novel tool named skewer for adapter trimming. It is aimed for preprocessing Illumina Paired-end/single-end reads at the moment. The main features are as follows: * Allow full-length adapter sequence trimming for higher specificity; * Allow indel errors when finding adapter sequence; * Very fast: Internally it uses a novel local alignment algorithm that has not been published. In single thread mode, it can process a pair of compressed files, whose uncompressed sizes were about 12G bytes each, in about 30 minutes; it is even faster in multi-thread mode, but the speedup is limited due to the parallelism is only made in the sequence alignment part. * Quality values aware. It evaluates alignments based on sequence qualities. * Paired information aware. It is more accurate in case of processing paired-end reads. If you are interested in using it, please download it from https://sourceforge.net/projects/skewer/ Any feedback or feature requests are welcome! ![]() Last edited by relipmoc; 09-23-2013 at 05:53 PM. Reason: :) |
![]() |
![]() |
![]() |
#2 |
Member
Location: Georgia Join Date: Mar 2013
Posts: 15
|
![]()
Hi,
I have to trim full-length adapter sequences with zero number of mismatches. I do not want to trim reads on any other criteria at this point. I am using the following command line: ./skewer-0.1.99-linux-x86_64 -x ACACTCTTTCCCTACACGACGCTCTTCCGATCT -y GATCGGAAGAGCGGTTCA GCAGGAATGCCGAG -r 0 -d 0 -o exact_trim_15 -t 8 read_1.fastq paired_read2.fastq Log file includes: Parameters used: -- 3' end adapter sequence (-x): ACACTCTTTCCCTACACGACGCTCTTCCGATCT -- paired 3' end adapter sequence (-y): GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG -- maximum error ratio allowed (-r): 0.000 -- maximum indel error ratio allowed (-d): 0.000 -- minimum read length allowed after trimming (-l): 18 -- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected) -- number of concurrent threads (-t): 8 Tue Jan 14 02:18:14 2014 >> started Tue Jan 14 02:19:33 2014 >> done (78.699s) 47656840 read pairs processed; of these: 0 ( 0.00%) short read pairs filtered out after trimming by size control 0 ( 0.00%) empty read pairs filtered out after trimming by size control 47656840 (100.00%) read pairs available; of these: 3202 ( 0.01%) trimmed read pairs available after processing 47653638 (99.99%) untrimmed read pairs available after processing Length distribution of reads after trimming: length count percentage 97 1 0.00% 98 4 0.00% 99 3197 0.01% 100 47653638 99.99% My questions are: 1) The 3197 read pairs trimmed, given the input parameter settings, are they really trimmed just based on exact full-length adapter sequence match? any default parameter that I should be aware of?I would appreciate your help! Thank you! |
![]() |
![]() |
![]() |
#3 |
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]()
Thank you so much for your feedback!
Quick answers to your questions: 1) The searching process is based on exact full-length adapter sequence, but for the 3197 read pairs, only the last nucleotides were identified as the first nucleotides of corresponding adapter sequences. In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing. 2) There's no need to specify the overlap length in paired-end mode. The program knows how to do it correctly. 3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values. If you just want to specify the number of maximum allowed mismatches in the full-length adapter sequence, you can use fq2fa.sh to transfer the FASTQ files to FASTA files, and specify the maximum allowed error ratio (-r) as 2/33=0.06. For small RNA adapter trimming, it is something like the following command: $ fq2fa.sh srnaReads.fq | skewer -x TCGTATGCCGTCTTCTGCTTGAAAAAAA -L 30 -r 0.06 -o trimmed - 4) For multiple adapter sequences, you just need to specify two FASTA files which contain adapter sequences, and input something like: $ skewer -x adapters1.fa -y adapters2.fa flowcell1_lane7_pair1.fastq.gz flowcell1_lane7_pair2.fastq.gz Last edited by relipmoc; 01-14-2014 at 08:46 AM. |
![]() |
![]() |
![]() |
#4 |
Member
Location: Georgia Join Date: Mar 2013
Posts: 15
|
![]()
Thank you for your prompt response!
I am sorry, I couldn't quite get the "In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing"? I don't think I have adapter more than 62 bp so then why its looking for last few nucleotides (3 I guess here?)? |
![]() |
![]() |
![]() |
#5 |
Member
Location: Georgia Join Date: Mar 2013
Posts: 15
|
![]()
Also, what is the base quality value threshold used by the tool to be considered as a mismatch? in "3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values"
Thanks! |
![]() |
![]() |
![]() |
#6 | |
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]()
As I said, "there's no need to specify the overlap length in paired-end mode", actually there's no parameter or default parameter for the overlap length in paired-end mode.
The 64 nt statement is irrelevant to your question. I just misunderstood your question "any default parameter that I should be aware of". ^_^ "why its looking for last few nucleotides (3 I guess here?)". Unfortunately your guess is not the truth. It's by chance that you got this result. Quote:
|
|
![]() |
![]() |
![]() |
#7 | |
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]()
There's no base quality value threshold. That's all integrated into the statistical scheme. Since we have not published the paper, I can not tell you the details at the moment. Sorry for that!
Quote:
Last edited by relipmoc; 01-14-2014 at 05:37 PM. |
|
![]() |
![]() |
![]() |
#8 |
Member
Location: boston Join Date: Aug 2010
Posts: 15
|
![]()
Hi relipmoc,
I have a couple of questions: 1) How does skewer handle partial matches? For example if I have a sequence that goes SEQUENCE-ADAPTER-BARCODE, and I just input ADAPTER, will I end up with SEQUENCE? 2) Why is this sequence not being trimmed? Does skewer only match the entire adapter sequence? @test_truseq/1 CGATGATCAAGACCCAAGTGTGAGATTACGGAGATCGGAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @test_truseq/2 CGATGATCAAGACCCAAGTGTGAGATTACTCAGATCGGAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ~/tmp/skewer-0.1.104-linux-x86_64 -x AGATCGGAAGAG -y AGATCGGAAGAG test_cutadapt_1.fastq test_cutadapt_2.fastq Thanks! I've been looking around for a faster trimmer and was hoping skewer would be the solution. |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Australia Join Date: Sep 2008
Posts: 136
|
![]() |
![]() |
![]() |
![]() |
#10 | |||
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]() Quote:
The answer is No. skewer can detect partially matched adapter sequence at the 3' end (or 5' end if '-e 5' is specified). Quote:
Quote:
My pleasure! Hope it will make your work easier. |
|||
![]() |
![]() |
![]() |
#11 |
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]() |
![]() |
![]() |
![]() |
#12 |
Member
Location: Pittsburgh Join Date: Dec 2012
Posts: 12
|
![]() |
![]() |
![]() |
![]() |
#13 |
Junior Member
Location: China Join Date: Oct 2011
Posts: 2
|
![]()
I have a question:
How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files. |
![]() |
![]() |
![]() |
#14 |
Junior Member
Location: China Join Date: Oct 2011
Posts: 2
|
![]()
I have a question:
How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files. |
![]() |
![]() |
![]() |
#15 |
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]()
Thank you for your question! Now skewer only output those pairs that are concordantly trimmed or untouched.
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: US Join Date: Dec 2010
Posts: 453
|
![]()
Hi replimoc,
I just found the answer in another thread: http://seqanswers.com/forums/showthread.php?t=41976 My use case should be no problem, I guess. (can skewer trim multiple potential contaminant sequences in the same run? - it seems it is focused on adapter pairs similar to trimmomatic's palindrome trimming mode?) Last edited by luc; 03-31-2014 at 03:23 PM. |
![]() |
![]() |
![]() |
#17 |
Member
Location: boston Join Date: Aug 2010
Posts: 15
|
![]()
Thanks replimoc, with 0.1.114 skewer trims off the test sequences i posted correctly. Thanks!
|
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: US Join Date: Dec 2010
Posts: 453
|
![]()
Hi replimoc,
it seems to me several sequences in the reverse reads are escaping removal when using the length threshold on paired reads. The filtering for length works fine for the forward reads. Does the "paired information aware" trimming option work when providing a single (-x) adapter file containing several adapter sequences? |
![]() |
![]() |
![]() |
#19 | ||
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]()
Hi Luc,
Thank you for your question! My answer goes as follows: Quote:
Quote:
Code:
>Index 1, ATCACG TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG >Index 2, CGATGT TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG >Index 3, TTAGGC TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG >Index 4, TGACCA TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG >Index 5, ACAGTG TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG >Index 6, GCCAAT TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG |
||
![]() |
![]() |
![]() |
#20 |
Member
Location: Los Angeles, CA Join Date: Jul 2011
Posts: 58
|
![]() |
![]() |
![]() |
![]() |
Thread Tools | |
|
|