Seqanswers Leaderboard Ad

**luc** · 03-31-2014, 02:18 PM

Hi replimoc,

I just found the answer in another thread:

http://seqanswers.com/forums/showthread.php?t=41976

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

My use case should be no problem, I guess.

(can skewer trim multiple potential contaminant sequences in the same run? - it seems it is focused on adapter pairs similar to trimmomatic's palindrome trimming mode?)

**roryk** · 06-02-2014, 07:17 PM

Thanks replimoc, with 0.1.114 skewer trims off the test sequences i posted correctly. Thanks!

**luc** · 06-03-2014, 02:14 PM

Hi replimoc,

it seems to me several sequences in the reverse reads are escaping removal when using the length threshold on paired reads. The filtering for length works fine for the forward reads.

Does the "paired information aware" trimming option work when providing a single (-x) adapter file containing several adapter sequences?

**relipmoc** · 06-04-2014, 05:32 PM

Hi Luc,
Thank you for your question! My answer goes as follows:

Originally posted by luc View Post

it seems to me several sequences in the reverse reads are escaping removal when using the length threshold on paired reads. The filtering for length works fine for the forward reads.

The length threshold -k does not influence the trimming result of paired-end data. Other length thresholds such as -l and -L do influence the results. Could you explain your case with more details?

Originally posted by luc View Post

Does the "paired information aware" trimming option work when providing a single (-x) adapter file containing several adapter sequences?

The answer is YES. However, you need to pay special attention on the trimming efficiency. The semantics of your case is to try n * n adapter combinations in adapter trimming, where n is the number of adapter sequences provided in the adapter file. If the adapter sequences share most of their content but differ in some region, e.g. 6-bp region for indexing, you may use degenerative characters in this region and specify one representative adapter sequence. For instance, if the content of the adapter file is:

Code:

>Index 1, ATCACG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>Index 2, CGATGT
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
>Index 3, TTAGGC
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
>Index 4, TGACCA
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
>Index 5, ACAGTG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
>Index 6, GCCAAT
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG

you may specify -x TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG to improve trimming efficiency.

**relipmoc** · 06-04-2014, 05:35 PM

Originally posted by roryk View Post

Thanks replimoc, with 0.1.114 skewer trims off the test sequences i posted correctly. Thanks!

Hi roryk, thank you for your feedback!

**luc** · 06-04-2014, 10:09 PM

Hi Replimoc,

thanks for the tip with the barcoded adapters. A very nice feature.

I had the strange results when trimming paired end data using the parameter "-l 20" .
All the read pairs containing forward reads shorter than 20 bases were indeed filtered out, but not all of the read pairs containing reverse reads shorter than 20 bases.

Btw, does skewer search for the reverse complements of the adapters by default (likely not in the paired mode)?

**relipmoc** · 06-05-2014, 07:01 AM

Originally posted by luc View Post

I had the strange results when trimming paired end data using the parameter "-l 20" .
All the read pairs containing forward reads shorter than 20 bases were indeed filtered out, but not all of the read pairs containing reverse reads shorter than 20 bases.

Could you show us the problematic PE reads in FASTQ format? So that I can figure out what's wrong with the program.

Originally posted by luc View Post

Btw, does skewer search for the reverse complements of the adapters by default (likely not in the paired mode)?

The answer is NO.

**blsfoxfox** · 06-06-2014, 08:35 AM

trimmed reads longer than the length!

Hi Relipmoc,

Thank you for this software. I met a problem may need your help.

I am dealing with the Hiseq 2500 data with Nextra Mate Pair and following is the parameters used:

skewer-0.1.114-linux-x86_64 -x GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -y GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -j CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -m mp -k 9 -f sanger -l 30 -L 150 -o skewer_library1_2 1.fastq 2.fastq

-- 3' end adapter sequence (-x): GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
-- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
-- maximum error ratio allowed (-r): 0.100
-- maximum indel error ratio allowed (-d): 0.030
-- minimum read length allowed after trimming (-l): 30
-- maximum read length for output (-L): 150
-- file format (-f): Sanger/Illumina 1.8+ FASTQ
-- minimum overlap length for junction adapter detection (-k): 9
Wed Jun 4 15:28:27 2014 >> started

Thu Jun 5 10:40:33 2014 >> done (69126.658s)
208936993 read pairs processed; of these:
93035 ( 0.04%) non-junction read pairs filtered out by contaminant control
29290940 (14.02%) short read pairs filtered out after trimming by size control
6182785 ( 2.96%) empty read pairs filtered out after trimming by size control
173370233 (82.98%) read pairs available; of these:
94951230 (54.77%) trimmed read pairs available after processing
78419003 (45.23%) untrimmed read pairs available after processing

And the Length distribution of reads after trimming provided by skewer shows the maximum reads are 150bp.

However, when I test the result with FastQC, I found there are many reads longer than 150bp ( please see the attachment). I also found those "long" reads by eyeballing in the result file.

I would like to know have you ever experienced something like this? What would be the reason you think?

P.S I have tried this with and without -L 150, and there are longer reads in both cases.

Thanks,

Attached Files

skewer fastqc.jpg (94.8 KB, 21 views)

**relipmoc** · 06-08-2014, 09:51 AM

Originally posted by blsfoxfox View Post

Hi Relipmoc,

Thank you for this software. I met a problem may need your help.

I am dealing with the Hiseq 2500 data with Nextra Mate Pair and following is the parameters used:

skewer-0.1.114-linux-x86_64 -x GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -y GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -j CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -m mp -k 9 -f sanger -l 30 -L 150 -o skewer_library1_2 1.fastq 2.fastq

-- 3' end adapter sequence (-x): GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
-- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
-- maximum error ratio allowed (-r): 0.100
-- maximum indel error ratio allowed (-d): 0.030
-- minimum read length allowed after trimming (-l): 30
-- maximum read length for output (-L): 150
-- file format (-f): Sanger/Illumina 1.8+ FASTQ
-- minimum overlap length for junction adapter detection (-k): 9
Wed Jun 4 15:28:27 2014 >> started

Thu Jun 5 10:40:33 2014 >> done (69126.658s)
208936993 read pairs processed; of these:
93035 ( 0.04%) non-junction read pairs filtered out by contaminant control
29290940 (14.02%) short read pairs filtered out after trimming by size control
6182785 ( 2.96%) empty read pairs filtered out after trimming by size control
173370233 (82.98%) read pairs available; of these:
94951230 (54.77%) trimmed read pairs available after processing
78419003 (45.23%) untrimmed read pairs available after processing

And the Length distribution of reads after trimming provided by skewer shows the maximum reads are 150bp.

However, when I test the result with FastQC, I found there are many reads longer than 150bp ( please see the attachment). I also found those "long" reads by eyeballing in the result file.

I would like to know have you ever experienced something like this? What would be the reason you think?

P.S I have tried this with and without -L 150, and there are longer reads in both cases.

Thanks,

Hi blsfoxfox,

Thank you very much for your feedback! The name of the parameter is misleading. Its actual meaning is the maximum equivalent read length. For example, if the length of trimmed read 1 is 224 and the length of trimmed read 2 is 40, then the equivalent read length is int((224 + 40) / 2) = 132. Therefore, using "-L 150" can not filter out this read pair. But if you use "-L 120", you can filter out this read pair.

For your case, you can try "-L 75". But I guess this is not what you want. we may upgrade skewer to add another parameter for clipping bases after a specified length.

**blsfoxfox** · 06-11-2014, 09:36 PM

Originally posted by relipmoc View Post

Hi blsfoxfox,

Thank you very much for your feedback! The name of the parameter is misleading. Its actual meaning is the maximum equivalent read length. For example, if the length of trimmed read 1 is 224 and the length of trimmed read 2 is 40, then the equivalent read length is int((224 + 40) / 2) = 132. Therefore, using "-L 150" can not filter out this read pair. But if you use "-L 120", you can filter out this read pair.

For your case, you can try "-L 75". But I guess this is not what you want. we may upgrade skewer to add another parameter for clipping bases after a specified length.

Thank you for the response! You're right, I would like to clip bases in each reads file.

Actually, I am more curious about why would skewer produce trimmed reads longer than original one? Then we may avoid getting the long reads and do not need another parameter to deal with it.

By the way, skewer is really fast

**relipmoc** · 06-13-2014, 08:47 AM

Originally posted by blsfoxfox View Post

Thank you for the response! You're right, I would like to clip bases in each reads file.

We will add a parameter for clipping bases in the future versions.

Originally posted by blsfoxfox View Post

Actually, I am more curious about why would skewer produce trimmed reads longer than original one? Then we may avoid getting the long reads and do not need another parameter to deal with it.

Good question! For Nextera long mate-pair (LMP) reads, skewer first treats them as normal paired-end (PE) reads and trims adapters from them. The trimmed reads correspond to fragments that were originally shorter than the read length. If no junction adapter was found within it, then the trimmed read pair is marked as a non-junction read pair which should be removed as it is contaminant.

Otherwise, non-trimmed reads correspond to fragments that are originally equal to or greater than the read length. These read pairs can be classified into three classes. 1) junction adapters are found in the middle of both reads of the pair; 2) junction adapter is found in the middle of one read of the pair; 3) junction adapter is not found in either read of the pair. For class 1), skewer just trims the junction adapters as in single end (SE) cases; for class 2), without loss of generality, suppose read 1 contains junction adapter while read 2 does not contain junction adapter, skewer searches the best overlap between 3' end of read 1 and 5' end of the reverse complement of read 2 , if the overlap is after the junction adapter region of read 1, then the sub-sequences after junction adapter region of read 1 is transferred to its reverse-complemented counterpart and appended to read 2. Then you can find some reads have lengths greater than read length after adapter trimming.

Originally posted by blsfoxfox View Post

By the way, skewer is really fast

Thank you for the praise!

**relipmoc** · 06-13-2014, 08:56 AM

skewer has been accepted as a methodology paper in BMC Bioinformatics

If you find skewer is useful for your study, please kindly cite it in your paper. Thank you!

BMC Bioinformatics.2014, 15:182
DOI: 10.1186/1471-2105-15-182
URL: http://www.biomedcentral.com/1471-2105/15/182

**ug14cxb** · 07-14-2014, 07:40 AM

The source code: https://github.com/relipmoc/skewer is here.
I would have thought it would be on sourceforge but github is way better.
Thanks for sharing this

**MikhailFokin** · 07-30-2014, 08:25 PM

Originally posted by relipmoc View Post

Good question! For Nextera long mate-pair (LMP) reads, skewer first treats them as normal paired-end (PE) reads and trims adapters from them. The trimmed reads correspond to fragments that were originally shorter than the read length. If no junction adapter was found within it, then the trimmed read pair is marked as a non-junction read pair which should be removed as it is contaminant.

Otherwise, non-trimmed reads correspond to fragments that are originally equal to or greater than the read length. These read pairs can be classified into three classes. 1) junction adapters are found in the middle of both reads of the pair; 2) junction adapter is found in the middle of one read of the pair; 3) junction adapter is not found in either read of the pair. For class 1), skewer just trims the junction adapters as in single end (SE) cases; for class 2), without loss of generality, suppose read 1 contains junction adapter while read 2 does not contain junction adapter, skewer searches the best overlap between 3' end of read 1 and 5' end of the reverse complement of read 2 , if the overlap is after the junction adapter region of read 1, then the sub-sequences after junction adapter region of read 1 is transferred to its reverse-complemented counterpart and appended to read 2. Then you can find some reads have lengths greater than read length after adapter trimming.

Thank you much for this explanation! It is somehow strange why the SEQanswers is the only place where it was explained

And there are still few questions about the way how Skewer process Nextera libraries.

1. For the 1st case does "SE trimming" mean removing junction adapter and following sequence till the 5' end as well?
2. For the second case - adaptor in a one read only
(A) what is "the best overlap" - length? mismatches?
(B) what Skewer does is there is no overlap between reads?
3. How to switch of trimming of external adaptors?
4. In the analysis below it is not clear what is "549499 (24.26%) untrimmed read pairs available after processing", how can any untrimmed reads being present in result? not removed to "5968 ( 0.20%) non-junction read pairs filtered out by contaminant control"

skewer -m mp -t 16 -k 30 -l 40 -b S4-R1.fastq S4-R2.fastq

Parameters used:
-- 3' end adapter sequence (-x): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
-- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
-- maximum error ratio allowed (-r): 0.100
-- maximum indel error ratio allowed (-d): 0.030
-- minimum read length allowed after trimming (-l): 40
-- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
-- minimum overlap length for junction adapter detection (-k): 30
-- number of concurrent threads (-t): 16

3016744 read pairs processed; of these:
5968 ( 0.20%) non-junction read pairs filtered out by contaminant control
725620 (24.05%) short read pairs filtered out after trimming by size control
20306 ( 0.67%) empty read pairs filtered out after trimming by size control
2264850 (75.08%) read pairs available; of these:
1715351 (75.74%) trimmed read pairs available after processing
549499 (24.26%) untrimmed read pairs available after processing

Barcode dispatch after trimming:
category count percentage:
X01Y01 1422074 82.90%

Thank you...

**relipmoc** · 08-04-2014, 10:17 PM

Originally posted by MikhailFokin View Post

1. For the 1st case does "SE trimming" mean removing junction adapter and following sequence till the 5' end as well?

"SE trimming" means removing junction adapter and its following sequence at the 3' end.

Originally posted by MikhailFokin View Post

2. For the second case - adaptor in a one read only
(A) what is "the best overlap" - length? mismatches?
(B) what Skewer does is there is no overlap between reads?

(A) There may be several candidate overlap sites, the best overlap is selected according to the scoring scheme presented in the paper. The threshold for the overlap detection is proportional to the -r threshold specified by the user.
(B) no additional action for this case

Originally posted by MikhailFokin View Post

3. How to switch of trimming of external adaptors?

Do you mean to trim the external adapters only? For research purpose, you may use PE mode instead of MP mode. But it is not recommended.

Originally posted by MikhailFokin View Post

4. In the analysis below it is not clear what is "549499 (24.26%) untrimmed read pairs available after processing", how can any untrimmed reads being present in result? not removed to "5968 ( 0.20%) non-junction read pairs filtered out by contaminant control"

skewer -m mp -t 16 -k 30 -l 40 -b S4-R1.fastq S4-R2.fastq

Parameters used:
-- 3' end adapter sequence (-x): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
-- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
-- maximum error ratio allowed (-r): 0.100
-- maximum indel error ratio allowed (-d): 0.030
-- minimum read length allowed after trimming (-l): 40
-- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
-- minimum overlap length for junction adapter detection (-k): 30
-- number of concurrent threads (-t): 16

3016744 read pairs processed; of these:
5968 ( 0.20%) non-junction read pairs filtered out by contaminant control
725620 (24.05%) short read pairs filtered out after trimming by size control
20306 ( 0.67%) empty read pairs filtered out after trimming by size control
2264850 (75.08%) read pairs available; of these:
1715351 (75.74%) trimmed read pairs available after processing
549499 (24.26%) untrimmed read pairs available after processing

Barcode dispatch after trimming:
category count percentage:
X01Y01 1422074 82.90%

Thank you...

It means the 3rd case which is different from the case of non-junction read pairs. For the 3rd case, we can not declare that there is no junction adapter in the fragment. However, for the non-junction read pairs, the fragment length is shorter than the read length, we can declare confidently that they do not contain junction adapters.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News