Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi replimoc,

    I just found the answer in another thread:
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    My use case should be no problem, I guess.

    (can skewer trim multiple potential contaminant sequences in the same run? - it seems it is focused on adapter pairs similar to trimmomatic's palindrome trimming mode?)
    Last edited by luc; 03-31-2014, 02:23 PM.

    Comment


    • #17
      Thanks replimoc, with 0.1.114 skewer trims off the test sequences i posted correctly. Thanks!

      Comment


      • #18
        Hi replimoc,

        it seems to me several sequences in the reverse reads are escaping removal when using the length threshold on paired reads. The filtering for length works fine for the forward reads.

        Does the "paired information aware" trimming option work when providing a single (-x) adapter file containing several adapter sequences?

        Comment


        • #19
          Hi Luc,
          Thank you for your question! My answer goes as follows:

          Originally posted by luc View Post
          it seems to me several sequences in the reverse reads are escaping removal when using the length threshold on paired reads. The filtering for length works fine for the forward reads.
          The length threshold -k does not influence the trimming result of paired-end data. Other length thresholds such as -l and -L do influence the results. Could you explain your case with more details?

          Originally posted by luc View Post
          Does the "paired information aware" trimming option work when providing a single (-x) adapter file containing several adapter sequences?
          The answer is YES. However, you need to pay special attention on the trimming efficiency. The semantics of your case is to try n * n adapter combinations in adapter trimming, where n is the number of adapter sequences provided in the adapter file. If the adapter sequences share most of their content but differ in some region, e.g. 6-bp region for indexing, you may use degenerative characters in this region and specify one representative adapter sequence. For instance, if the content of the adapter file is:
          Code:
          >Index 1, ATCACG
          TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
          >Index 2, CGATGT
          TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
          >Index 3, TTAGGC
          TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
          >Index 4, TGACCA
          TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
          >Index 5, ACAGTG
          TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
          >Index 6, GCCAAT
          TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
          you may specify -x TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG to improve trimming efficiency.

          Comment


          • #20
            Originally posted by roryk View Post
            Thanks replimoc, with 0.1.114 skewer trims off the test sequences i posted correctly. Thanks!
            Hi roryk, thank you for your feedback!

            Comment


            • #21
              Hi Replimoc,

              thanks for the tip with the barcoded adapters. A very nice feature.

              I had the strange results when trimming paired end data using the parameter "-l 20" .
              All the read pairs containing forward reads shorter than 20 bases were indeed filtered out, but not all of the read pairs containing reverse reads shorter than 20 bases.

              Btw, does skewer search for the reverse complements of the adapters by default (likely not in the paired mode)?

              Comment


              • #22
                Originally posted by luc View Post
                I had the strange results when trimming paired end data using the parameter "-l 20" .
                All the read pairs containing forward reads shorter than 20 bases were indeed filtered out, but not all of the read pairs containing reverse reads shorter than 20 bases.
                Could you show us the problematic PE reads in FASTQ format? So that I can figure out what's wrong with the program.

                Originally posted by luc View Post
                Btw, does skewer search for the reverse complements of the adapters by default (likely not in the paired mode)?
                The answer is NO.

                Comment


                • #23
                  trimmed reads longer than the length!

                  Hi Relipmoc,

                  Thank you for this software. I met a problem may need your help.

                  I am dealing with the Hiseq 2500 data with Nextra Mate Pair and following is the parameters used:

                  skewer-0.1.114-linux-x86_64 -x GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -y GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -j CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -m mp -k 9 -f sanger -l 30 -L 150 -o skewer_library1_2 1.fastq 2.fastq

                  -- 3' end adapter sequence (-x): GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
                  -- paired 3' end adapter sequence (-y): GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
                  -- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
                  -- maximum error ratio allowed (-r): 0.100
                  -- maximum indel error ratio allowed (-d): 0.030
                  -- minimum read length allowed after trimming (-l): 30
                  -- maximum read length for output (-L): 150
                  -- file format (-f): Sanger/Illumina 1.8+ FASTQ
                  -- minimum overlap length for junction adapter detection (-k): 9
                  Wed Jun 4 15:28:27 2014 >> started

                  Thu Jun 5 10:40:33 2014 >> done (69126.658s)
                  208936993 read pairs processed; of these:
                  93035 ( 0.04%) non-junction read pairs filtered out by contaminant control
                  29290940 (14.02%) short read pairs filtered out after trimming by size control
                  6182785 ( 2.96%) empty read pairs filtered out after trimming by size control
                  173370233 (82.98%) read pairs available; of these:
                  94951230 (54.77%) trimmed read pairs available after processing
                  78419003 (45.23%) untrimmed read pairs available after processing

                  And the Length distribution of reads after trimming provided by skewer shows the maximum reads are 150bp.

                  However, when I test the result with FastQC, I found there are many reads longer than 150bp ( please see the attachment). I also found those "long" reads by eyeballing in the result file.

                  I would like to know have you ever experienced something like this? What would be the reason you think?

                  P.S I have tried this with and without -L 150, and there are longer reads in both cases.

                  Thanks,
                  Attached Files

                  Comment


                  • #24
                    Originally posted by blsfoxfox View Post
                    Hi Relipmoc,

                    Thank you for this software. I met a problem may need your help.

                    I am dealing with the Hiseq 2500 data with Nextra Mate Pair and following is the parameters used:

                    skewer-0.1.114-linux-x86_64 -x GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -y GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -j CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -m mp -k 9 -f sanger -l 30 -L 150 -o skewer_library1_2 1.fastq 2.fastq

                    -- 3' end adapter sequence (-x): GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
                    -- paired 3' end adapter sequence (-y): GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
                    -- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
                    -- maximum error ratio allowed (-r): 0.100
                    -- maximum indel error ratio allowed (-d): 0.030
                    -- minimum read length allowed after trimming (-l): 30
                    -- maximum read length for output (-L): 150
                    -- file format (-f): Sanger/Illumina 1.8+ FASTQ
                    -- minimum overlap length for junction adapter detection (-k): 9
                    Wed Jun 4 15:28:27 2014 >> started

                    Thu Jun 5 10:40:33 2014 >> done (69126.658s)
                    208936993 read pairs processed; of these:
                    93035 ( 0.04%) non-junction read pairs filtered out by contaminant control
                    29290940 (14.02%) short read pairs filtered out after trimming by size control
                    6182785 ( 2.96%) empty read pairs filtered out after trimming by size control
                    173370233 (82.98%) read pairs available; of these:
                    94951230 (54.77%) trimmed read pairs available after processing
                    78419003 (45.23%) untrimmed read pairs available after processing

                    And the Length distribution of reads after trimming provided by skewer shows the maximum reads are 150bp.

                    However, when I test the result with FastQC, I found there are many reads longer than 150bp ( please see the attachment). I also found those "long" reads by eyeballing in the result file.

                    I would like to know have you ever experienced something like this? What would be the reason you think?

                    P.S I have tried this with and without -L 150, and there are longer reads in both cases.

                    Thanks,
                    Hi blsfoxfox,

                    Thank you very much for your feedback! The name of the parameter is misleading. Its actual meaning is the maximum equivalent read length. For example, if the length of trimmed read 1 is 224 and the length of trimmed read 2 is 40, then the equivalent read length is int((224 + 40) / 2) = 132. Therefore, using "-L 150" can not filter out this read pair. But if you use "-L 120", you can filter out this read pair.

                    For your case, you can try "-L 75". But I guess this is not what you want. we may upgrade skewer to add another parameter for clipping bases after a specified length.

                    Comment


                    • #25
                      Originally posted by relipmoc View Post
                      Hi blsfoxfox,

                      Thank you very much for your feedback! The name of the parameter is misleading. Its actual meaning is the maximum equivalent read length. For example, if the length of trimmed read 1 is 224 and the length of trimmed read 2 is 40, then the equivalent read length is int((224 + 40) / 2) = 132. Therefore, using "-L 150" can not filter out this read pair. But if you use "-L 120", you can filter out this read pair.

                      For your case, you can try "-L 75". But I guess this is not what you want. we may upgrade skewer to add another parameter for clipping bases after a specified length.
                      Thank you for the response! You're right, I would like to clip bases in each reads file.

                      Actually, I am more curious about why would skewer produce trimmed reads longer than original one? Then we may avoid getting the long reads and do not need another parameter to deal with it.

                      By the way, skewer is really fast

                      Comment


                      • #26
                        Originally posted by blsfoxfox View Post
                        Thank you for the response! You're right, I would like to clip bases in each reads file.
                        We will add a parameter for clipping bases in the future versions.

                        Originally posted by blsfoxfox View Post
                        Actually, I am more curious about why would skewer produce trimmed reads longer than original one? Then we may avoid getting the long reads and do not need another parameter to deal with it.
                        Good question! For Nextera long mate-pair (LMP) reads, skewer first treats them as normal paired-end (PE) reads and trims adapters from them. The trimmed reads correspond to fragments that were originally shorter than the read length. If no junction adapter was found within it, then the trimmed read pair is marked as a non-junction read pair which should be removed as it is contaminant.

                        Otherwise, non-trimmed reads correspond to fragments that are originally equal to or greater than the read length. These read pairs can be classified into three classes. 1) junction adapters are found in the middle of both reads of the pair; 2) junction adapter is found in the middle of one read of the pair; 3) junction adapter is not found in either read of the pair. For class 1), skewer just trims the junction adapters as in single end (SE) cases; for class 2), without loss of generality, suppose read 1 contains junction adapter while read 2 does not contain junction adapter, skewer searches the best overlap between 3' end of read 1 and 5' end of the reverse complement of read 2 , if the overlap is after the junction adapter region of read 1, then the sub-sequences after junction adapter region of read 1 is transferred to its reverse-complemented counterpart and appended to read 2. Then you can find some reads have lengths greater than read length after adapter trimming.

                        Originally posted by blsfoxfox View Post
                        By the way, skewer is really fast
                        Thank you for the praise!
                        Last edited by relipmoc; 06-14-2014, 04:00 PM.

                        Comment


                        • #27
                          skewer has been accepted as a methodology paper in BMC Bioinformatics

                          If you find skewer is useful for your study, please kindly cite it in your paper. Thank you!

                          BMC Bioinformatics.2014, 15:182
                          DOI: 10.1186/1471-2105-15-182
                          URL: http://www.biomedcentral.com/1471-2105/15/182
                          Last edited by relipmoc; 06-13-2014, 09:00 AM. Reason: :)

                          Comment


                          • #28
                            The source code: https://github.com/relipmoc/skewer is here.
                            I would have thought it would be on sourceforge but github is way better.
                            Thanks for sharing this
                            Last edited by ug14cxb; 07-24-2014, 02:13 AM.

                            Comment


                            • #29
                              Originally posted by relipmoc View Post
                              Good question! For Nextera long mate-pair (LMP) reads, skewer first treats them as normal paired-end (PE) reads and trims adapters from them. The trimmed reads correspond to fragments that were originally shorter than the read length. If no junction adapter was found within it, then the trimmed read pair is marked as a non-junction read pair which should be removed as it is contaminant.

                              Otherwise, non-trimmed reads correspond to fragments that are originally equal to or greater than the read length. These read pairs can be classified into three classes. 1) junction adapters are found in the middle of both reads of the pair; 2) junction adapter is found in the middle of one read of the pair; 3) junction adapter is not found in either read of the pair. For class 1), skewer just trims the junction adapters as in single end (SE) cases; for class 2), without loss of generality, suppose read 1 contains junction adapter while read 2 does not contain junction adapter, skewer searches the best overlap between 3' end of read 1 and 5' end of the reverse complement of read 2 , if the overlap is after the junction adapter region of read 1, then the sub-sequences after junction adapter region of read 1 is transferred to its reverse-complemented counterpart and appended to read 2. Then you can find some reads have lengths greater than read length after adapter trimming.
                              Thank you much for this explanation! It is somehow strange why the SEQanswers is the only place where it was explained And there are still few questions about the way how Skewer process Nextera libraries.

                              1. For the 1st case does "SE trimming" mean removing junction adapter and following sequence till the 5' end as well?
                              2. For the second case - adaptor in a one read only
                              (A) what is "the best overlap" - length? mismatches?
                              (B) what Skewer does is there is no overlap between reads?
                              3. How to switch of trimming of external adaptors?
                              4. In the analysis below it is not clear what is "549499 (24.26%) untrimmed read pairs available after processing", how can any untrimmed reads being present in result? not removed to "5968 ( 0.20%) non-junction read pairs filtered out by contaminant control"

                              skewer -m mp -t 16 -k 30 -l 40 -b S4-R1.fastq S4-R2.fastq

                              Parameters used:
                              -- 3' end adapter sequence (-x): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
                              -- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
                              -- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
                              -- maximum error ratio allowed (-r): 0.100
                              -- maximum indel error ratio allowed (-d): 0.030
                              -- minimum read length allowed after trimming (-l): 40
                              -- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
                              -- minimum overlap length for junction adapter detection (-k): 30
                              -- number of concurrent threads (-t): 16

                              3016744 read pairs processed; of these:
                              5968 ( 0.20%) non-junction read pairs filtered out by contaminant control
                              725620 (24.05%) short read pairs filtered out after trimming by size control
                              20306 ( 0.67%) empty read pairs filtered out after trimming by size control
                              2264850 (75.08%) read pairs available; of these:
                              1715351 (75.74%) trimmed read pairs available after processing
                              549499 (24.26%) untrimmed read pairs available after processing

                              Barcode dispatch after trimming:
                              category count percentage:
                              X01Y01 1422074 82.90%



                              Thank you...

                              Comment


                              • #30
                                Originally posted by MikhailFokin View Post
                                1. For the 1st case does "SE trimming" mean removing junction adapter and following sequence till the 5' end as well?
                                "SE trimming" means removing junction adapter and its following sequence at the 3' end.

                                Originally posted by MikhailFokin View Post
                                2. For the second case - adaptor in a one read only
                                (A) what is "the best overlap" - length? mismatches?
                                (B) what Skewer does is there is no overlap between reads?
                                (A) There may be several candidate overlap sites, the best overlap is selected according to the scoring scheme presented in the paper. The threshold for the overlap detection is proportional to the -r threshold specified by the user.
                                (B) no additional action for this case

                                Originally posted by MikhailFokin View Post
                                3. How to switch of trimming of external adaptors?
                                Do you mean to trim the external adapters only? For research purpose, you may use PE mode instead of MP mode. But it is not recommended.

                                Originally posted by MikhailFokin View Post
                                4. In the analysis below it is not clear what is "549499 (24.26%) untrimmed read pairs available after processing", how can any untrimmed reads being present in result? not removed to "5968 ( 0.20%) non-junction read pairs filtered out by contaminant control"

                                skewer -m mp -t 16 -k 30 -l 40 -b S4-R1.fastq S4-R2.fastq

                                Parameters used:
                                -- 3' end adapter sequence (-x): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
                                -- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
                                -- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
                                -- maximum error ratio allowed (-r): 0.100
                                -- maximum indel error ratio allowed (-d): 0.030
                                -- minimum read length allowed after trimming (-l): 40
                                -- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
                                -- minimum overlap length for junction adapter detection (-k): 30
                                -- number of concurrent threads (-t): 16

                                3016744 read pairs processed; of these:
                                5968 ( 0.20%) non-junction read pairs filtered out by contaminant control
                                725620 (24.05%) short read pairs filtered out after trimming by size control
                                20306 ( 0.67%) empty read pairs filtered out after trimming by size control
                                2264850 (75.08%) read pairs available; of these:
                                1715351 (75.74%) trimmed read pairs available after processing
                                549499 (24.26%) untrimmed read pairs available after processing

                                Barcode dispatch after trimming:
                                category count percentage:
                                X01Y01 1422074 82.90%



                                Thank you...
                                It means the 3rd case which is different from the case of non-junction read pairs. For the 3rd case, we can not declare that there is no junction adapter in the fragment. However, for the non-junction read pairs, the fragment length is shorter than the read length, we can declare confidently that they do not contain junction adapters.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Advancing Precision Medicine for Rare Diseases in Children
                                  by seqadmin




                                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                  12-16-2024, 07:57 AM
                                • seqadmin
                                  Recent Advances in Sequencing Technologies
                                  by seqadmin



                                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                  Long-Read Sequencing
                                  Long-read sequencing has seen remarkable advancements,...
                                  12-02-2024, 01:49 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 12-17-2024, 10:28 AM
                                0 responses
                                27 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-13-2024, 08:24 AM
                                0 responses
                                43 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-12-2024, 07:41 AM
                                0 responses
                                29 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-11-2024, 07:45 AM
                                0 responses
                                42 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X