Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    It only removes the whole read if the remaining length is less than the minimum length threshold, which, by default is 15:

    -l N Minimum remaining sequence length (15)

    15's pretty low, but you can lower it.

    Comment


    • #17
      But when I run it on the test data. It seems fastq-mcf remove the whole reads. Do I set something wrong?

      fastq-mcf adapters.fa test.fastq -q 0 -l 1 -o test.trim
      Scale used: 2.2
      Threshold used: 1 out of 4
      Adapter date_day (CCTTAAGG): counted 3 at the 'end' of 'test.fastq', clip set to 1
      Files: 1
      Total reads: 4
      Too short after clip: 3

      ----------------adapters.fa----------------
      >date_day
      CCTTAAGG
      >adaptor1
      AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

      ------------------test.fastq------------------------
      @test1
      CCTTAAGGAAAAAAAAAAGGGGGGGGGG
      +test1
      HHHHHHHHHHHHHHHHHHHHHHHHHHHH
      @test2
      CCTTAAGGAAAAAAAAAGGGGGGGGGGG
      +test2
      HHHHHHHHHHHHHHHHHHHHHHHHHHHH
      @test3
      AGAGAGAGAGAGAGAGAGAGAGAGAGAG
      +test3
      HHHHHHHHHHHHHHHHHHHHHHHHHHHH
      @test4
      CCTTAAGGTTGACGTGATCGACACCTGG
      +test4
      [[[[[[[[[[[[[[[[[[[[[[[[[[[[

      Comment


      • #18
        You're right, I'd introduced a bug in detecting starts. I'm sending a release and adding it to my tests. The problem was it should have been saying at 'start' of sequence... not 'end'. By thinking it was an 'end' adapter, it was trimming from the right... hence removing the whole sequence.

        (REVISION: The problem was very short adapter sequences in the input file. fastq-mcf can't be used that way. It uses subsampling to get the adapter thresholds, and requires a minimum of 15 character matches during the subsampling pass)
        Last edited by earonesty; 06-20-2011, 09:40 AM.

        Comment


        • #19
          In my opions, most of adaptor is at the start of reads. Is it right?

          In my RNA-seq data, I just see a pattern at the start, not the end.

          In fact, I cannot understand the means of these options.

          -s N.N Log scale for clip pct to threshold (2.5)
          -t N % occurance threshold before clipping (0.25)
          -m N Minimum clip length, overrides scaled auto (1)
          -p N Maximum adapter difference percentage (20) This is for mismatch?
          -k N sKew percentage causing trimming (2)

          Comment


          • #20
            They do deserve some elaboration in the help:

            # -s N.N Log scale for clip pct to threshold (2.5)

            Scaling factor which causes more frequent occurrences to clip heavily. The negative log base "scale" of the ratio of adapters found becomes the minimum match-length for the adapter.

            # -t N % occurance threshold before clipping (0.25)

            Minimum number of times an adapter needs to be found before clipping is considered necessary


            # -m N Minimum clip length, overrides scaled auto (1)

            Replaces scaling algorithm (above) with hardcoded limit

            # -p N Maximum adapter difference percentage (20) This is for mismatch?

            Yes this is the max # mismatch

            # -k N sKew percentage causing trimming (2)

            If one of the bases is only 2% of the reads, then that cycle is considered "skewed".... if it's at the edge of a read, it's trimmmed off.

            Comment


            • #21
              earonesty,

              I have tried your new release. The output is fellow.

              Threshold used: 1 out of 4. What is the mean here?

              Why both Adapter date_day and Adapter adaptor1 counted 4 at the 'start' of 'test.fastq'?

              I also noticed that the "N" at the begin/end of reads that are not trimed.

              _____________________output___________________________

              fastq-mcf test.fa test.fastq -q 0 -l 1 -o test.trim
              Scale used: 2.2
              Threshold used: 1 out of 4
              Adapter date_day (CCTTAAGG): counted 4 at the 'start' of 'test.fastq', clip set to 1
              Adapter adaptor1 (AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT): counted 4 at the 'start' of 'test.fastq', clip set to 1
              Files: 1
              Total reads: 4
              Too short after clip: 0
              Clipped 'start' reads: Count: 3, Mean: 8.00, Sd: 0.00

              Comment


              • #22
                Well, I've never tried this with short adapters like that. Looks like the thing I did to fix your case made other things worse. I need to put the code back the way it was. The original behavior was more correct. The problem is the use of a very short adapter sequence, which is shorter than the scan length the sytem uses (15 bases). Wtih something that short, it will find them everywhere,and it isn't a realistic test. I panicked and gave you a version that was a lot worse.

                Solution: Use the original version ( i put it back... v 148), and use a sequence longer than 15 BP for your test adapter.

                ( If you really need it to work with a very short string like that... I can get it to work... just the code that does the subsampling needs to change... not the rest. )
                Last edited by earonesty; 06-20-2011, 12:18 PM.

                Comment


                • #23
                  earonesty,

                  Thanks. I will try it.

                  Comment


                  • #24
                    Also, right now the default algorithm will "not clip" if no adapters are found and no skewing is detected in the subsample (unless you pass -f). I'm about to make a change that will also decide clipping is necessary if there are "significant" "low quality region" at either end of the reads. The definition of significance will be based on the -q parameter. If more than 10% of the reads would be trimmed by that parameter, clipping will proceed.

                    Comment


                    • #25
                      Here I also confused with the parameters -f. If no adapters are found and no skewing is detected in the subsample, set -f what will happen? Will it do the trim?

                      Why if more than 10% of the reads would be trimmed by that parameter, clipping will proceed? Does it mean fasta-mcf do the trim only when 10% of total reads need to trim?

                      Do I have some misunderstand?

                      In fact, here just my think.

                      1, When we found adaptor at either end of read (for example, 10% mismatch), we do the trim.

                      2, From the 3' (right part) of read, if the nucleotide's quality is less than the threshold (for example, -q 20), then do the trim.

                      Because the adaptor contamination and low quality nucleotide will let the mapping not correctly.




                      Originally posted by earonesty View Post
                      Also, right now the default algorithm will "not clip" if no adapters are found and no skewing is detected in the subsample (unless you pass -f). I'm about to make a change that will also decide clipping is necessary if there are "significant" "low quality region" at either end of the reads. The definition of significance will be based on the -q parameter. If more than 10% of the reads would be trimmed by that parameter, clipping will proceed.

                      Comment


                      • #26
                        Right.... maybe it should always run ... and -f should be a non-option. I've thought about that. But in my experience, it's better not to clip at all if the percentage clipped is very low. Better to just let those reads get discarded by the aligner... or marked as low-quality mappings and get washed out in the statistics later.

                        Good aligners take into account quality scores when doing alignment, and variant callers do as well. We generally see higher repeatability on unclipped files... but only when the clipping percentage is low. In the 5-10% range. If 95% of the reads would be left alone anyway.. better not to run at all.

                        I'll run some stats, we have about 10,000 samples to look at right now, so i can come up with a decent default threshold. Again... -f will force it to run always, so you can just always run it that way and get what you want.

                        UPDATE: 5% is working well, I'm using it in production for new batches. If you want it to "always" try to clip, regardless of sampling, use -f.
                        ALSO: I made it so short adapters work as "beginnings of sequence" adapters (they always worked for end of seq tests)
                        Last edited by earonesty; 06-22-2011, 07:29 AM.

                        Comment


                        • #27
                          Originally posted by Mark View Post
                          Hi All

                          I recently downloaded the FASTX toolkit and tried to use it for trimming fastq reads of adapter sequences. This did not work, the tool simply discarded any reads containing adapter sequences though this is not seemingly its function according to the documentation. I wrote to the help contact for the tool but recieved no response (see below for details). Has anyone used this tool for this purpose successfully?

                          Thanks for your help

                          Mark

                          #############################################
                          Hello

                          I recently downloaded the FASTX toolkit (fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2) and attempted to use the fastx_clipper tool. I created a test fastq file (3 of the four sequences contain the default adapter CCTTAAGG):

                          @test1
                          CCTTAAGGAAAAAAAAAAGGGGGGGGGG
                          +test1
                          HHHHHHHHHHHHHHHHHHHHHHHHHHHH
                          @test2
                          CCTTAAGGAAAAAAAAAGGGGGGGGGGG
                          +test2
                          HHHHHHHHHHHHHHHHHHHHHHHHHHHH
                          @test3
                          AGAGAGAGAGAGAGAGAGAGAGAGAGAG
                          +test3
                          HHHHHHHHHHHHHHHHHHHHHHHHHHHH
                          @test4
                          CCTTAAGGTTGACGTGATCGACACCTGG
                          +test4
                          [[[[[[[[[[[[[[[[[[[[[[[[[[[[

                          And then executed the command (as shown on FASTX toolkit website)

                          -bash-3.2$ fastx_clipper -v -i test.fastq -a CCTTAAGG
                          @test3
                          AGAGAGAGAGAGAGAGAGAGAGAGAGAG
                          +test3
                          HHHHHHHHHHHHHHHHHHHHHHHHHHHH
                          Clipping Adapter: CCTTAAGG
                          Min. Length: 5
                          Input: 4 reads.
                          Output: 1 reads.
                          discarded 0 too-short reads.
                          discarded 3 adapter-only reads.
                          discarded 0 N reads.

                          As you can see, the three reads that contain the adapter are discarded as “adapter-only reads” which (in my way of looking at things) they are not nor are they too short (default <=5) after any trimming. What is going on here? Does this tool actually trim reads or only discard them if they are found. If the former would you please tell me what I am doing incorrectly? Also if the former, is it possible to supply the tool with multiple adapters to trim?

                          Thanks for your help

                          Mark
                          I don't know if you have already sorted out this problem.But i figured out that fastx_clipper throws away all those reads that starts with adapters (adapter only reads) and there is no way you can tell fastx_clipper not to do that. However if the adapter sequence is found anywhere with in the read then it will clip the read starting with the adapter and keeps the rest of the read. Then you can either ask it to throw the clip read away (with -C option) or keep the clipped read (with -c option).

                          Hope this helps a bit......

                          Upendra

                          Comment


                          • #28
                            Hi earonesty!

                            Thanks a lot for this great tool. I found it just today and the first test with fastq-mcf already left the impression that it is both fast and includes a lot of utilities for clippling.

                            However, I didn't get the meaning of all the options and the output.
                            In particular I would like to know what is meant by "Filtered x reads on purity flag".
                            Here's a sample report of a test case where I lose some 15 % due to this filter (see last line):

                            Code:
                            Scale used: 2.2
                            Filtering Illumina reads on purity field
                            Phred: 33
                            Warning: Too much skewing found (110), disabling skew clipping
                            Threshold used: 251 out of 100000
                            Adapter RNA-seq_PCR-primer_1_reverse (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT): counted 21114 at the 'end' of '../rawdata/ado_pool_PE02_R2.fastq', clip set to 1
                            Adapter RNA-seq_PCR-primer_2_reverse (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG): counted 21340 at the 'end' of '../rawdata/ado_pool_PE02_R1.fastq', clip set to 1
                            Adapter RNA-seq_PCR-primer_2_reverse (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG): counted 449 at the 'end' of '../rawdata/ado_pool_PE02_R2.fastq', clip set to 6
                            Files: 2
                            Total reads: 42361724
                            Too short after clip: 420423
                            Clipped 'end' reads (../rawdata/ado_pool_PE02_R1.fastq): Count 20076452, Mean: 20.70, Sd: 24.83
                            Trimmed 16073640 reads (../rawdata/ado_pool_PE02_R1.fastq) by an average of 22.84 bases on quality < 10
                            Clipped 'end' reads (../rawdata/ado_pool_PE02_R2.fastq): Count 18776062, Mean: 22.10, Sd: 25.18
                            Trimmed 15738855 reads (../rawdata/ado_pool_PE02_R2.fastq) by an average of 21.90 bases on quality < 10
                            Filtered 6360682 reads on purity flag
                            I ran fastq-mcf with option -k 100 (this disables skew clipping, does it?).

                            What is purity? Were the reads bad (and in what sense)?
                            Is there a way to switch this off?

                            Comment


                            • #29
                              Purity is illumina's purity filter. you can turn this off with -U ... bu you REALLY SHOULD NOT turn it off. Read up on illumina purity filtering... it is the result of confused signal from adjacent clusters.

                              Comment


                              • #30
                                -k 0 disables skew detection. Normally there's no reason to disable it... it can help find problems in data.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 08:47 AM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                59 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X