Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastx_clipper ... not doing what I want )-:

    I'm looking for for some help with fastx clipper. Despite my best (if inexperienced) efforts, it's not doing what I want. So far it's little better than random. Worse, actually, since it's not clipping at the adapter sequence provided, but at other sequences entirely.

    I've analysed my Illumina data using FastQC. This showed contamination with indexed TruSeq adapters. The universal adapter did not show up (below the 0.1% threshold I guess). It also showed up fairly high (min 30%, mostly >60%) levels of sequence redundancy in the libraries.

    As part of a pipeline I used fastx_clipper to remove adapter-containing reads entirely (-C option). This removed huge numbers of reads (e.g. ~20% of all reads even for TruSeq Universal adapter - which did not show up in the FastQC analysis).

    I've spent the rest of the afternoon away from the pipeline, using commandline to try figure out what it's actually doing. For this I used a single fastq file (2.fq) as a test case.


    I used grep -c into the .fastq file (2.fq) to count up the instances of the Illumina universal adapter. This showed there were some, mostly at start of inserts; certainly nothing like the number fastx_clipper removed.

    PHP Code:
    grep -ce AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT 2.fq
    6903  
    # total  instances
    grep -ce ^AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT 2.fq
    6329  
    # almost all of them at start of reads (i.e. no insert) 

    I ran a manual fastx_clipper (path specifications removed)
    PHP Code:
    fastx_clipper -Cva AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT 
    -i 2.fq -o 2.noUniv.fq
    Min
    Length5
    Clipped reads 
    discarded.
    Input4959421 reads.
    Output3937769 reads.
    discarded 55558 too-short reads.
    discarded  26892 adapter-only reads.  
    discarded 936303 clipped reads.      # these two lines total  963195 ...
    discarded 2899 N reads.                 ... >>>> the 6903 from grep


    Since I couldn't understand what was going on, I repeated this but used option -c to retained only those sequences that had been clipped (ie those reads that had originally had the adapter). Numbers-wise outcome was comparable to the first run.
    PHP Code:
    fastx_clipper -cva AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT  
    -i 2.fq -o 2_clippedonly
    Min
    Length5
    Non
    -Clipped reads discarded.
    Input4959421 reads.
    Output936149 reads.         # that's 936149 reads that originally HAD adapter
    discarded 55558 too-short reads.
    discarded 26892 adapter-only reads.
    discarded 3940668 non-clipped reads.
    discarded 154 N reads

    I extracted at random a single sequence (TGGTATTTTATTTTTCTACCTAAATTT) from this file, and grepped back into the file to recover all reads terminating with the sequence.
    PHP Code:
    grep TGGTATTTTATTTTTCTACCTAAATTT 2_clippedonly sort uniq -|  sort -k1,1n
      
    2_clipped_TGGTATTTTATTTTTCTACCTAAATTT 

    Then I did the same with the original file (2.fq) so I had a set of clipped and original unclipped sequences I could compare.
    PHP Code:
    grep TGGTATTTTATTTTTCTACCTAAATTT 2.fq sort uniq -|  sort -k1,1n 
     
    2_original_TGGTATTTTATTTTTCTACCTAAATTT 

    Comparison of the two on alternating lines below (first line clipped, second line original, etc) shows that the sequences removed by fastx clipper are not those supplied as the -a param string. If I blast the sequence removed it is contiguous sequence with the original random sequence (TGGTATTTTATTTTTCTACCTAAATTT) ... not the supplied commandline sequence string.


    CTTGGTATTTTATTTTTCTACCTAAATTT
    CTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAATTTA
    TCTTGGTATTTTATTTTTCTACCTAAATTT
    TCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAATTT
    TTTCTTGGTATTTTATTTTTCTACCTAAATTT
    TTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAAT
    TTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
    TTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGC
    ATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
    ATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATG
    AAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
    AAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATA
    GGAAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
    GGAAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAA

    I've since repeated this with 4 addition sequences from the fastx_clipper output, with the same result. This has left me baffled. I would welcome someone pointing out my simple error (?)!

    M
    Last edited by mgg; 12-07-2011, 01:27 AM. Reason: more informative title, early precis of problem

  • #2
    I used fastx_clipper like this:

    Code:
    zcat sequences.gz | fastx_clipper -v -l 20 -M 15 -a GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG | other_steps >filterered.fastq
    I also am wondering if this is actually doing what I asked it for. When I had the program give me the 'adaptor-only' sequences, they were with the adaptor in the beginning, followed by other bases. So, not adaptor-only.

    Your post makes me wonder if the program is very buggy, and perhaps should not be used. Have you contacted the author(s)?

    Comment


    • #3
      Hi,

      Did you try Cutadapt?


      Emilie

      Comment


      • #4
        Hi,

        I've also had problems with fastx clipper. Its seems that not only is it not doing the trimming I expect, but some thing is also a bit odd about what it outputs.


        I tested this by passing it a single read, which is filtered out and returned an empty output. If I then pass the same read along with two others, then the read is retained, and one of the others that should be returned is not.

        I'm currently in the processes of switching in cutadapt into my pipeline. Will report back on my results.

        Comment


        • #5
          I have had some simliar issues with figuring out my fastx output. Has anyone spoken with the authors of the program? I haven't been able to glean much from their website or manual on the details underlying this program.

          Comment


          • #6
            I've been using cutadapt with more success.

            Comment


            • #7
              I had problems too with fastx_clipper lack of specificity and I wrote to the authors.

              They wrote back and told me that it was indeed a limitation of the program. The fact is that fastx_clipper was designed for small RNA experiments and so it's tweaked to be very sensitive and not specific at all, clipping anything that resembles an adapter and any nucleotides after that.

              As a consequence it's not the best tool to clip adapters for general experiments and it's only suitable for the small RNA ones.

              Comment


              • #8
                Use Trimmomatic....better software that is mate-pair aware

                Comment


                • #9
                  We (the Enright lab) have developed reaper and tally. The first is for demultiplexing, stripping, trimming adapter, and filtering of various sorts, the second for deduplicating sequence data. They can work in conjunction or apart, and allow handling of paired-end files. Adapter stripping is handled by Smith-Waterman local alignments, and highly customisable. Manual: http://www.ebi.ac.uk/~stijn/reaper/reaper.html, download: http://www.ebi.ac.uk/~stijn/reaper/s...per-12-205.tgz. These are parts of a larger exciting pipeline, to be published imminently, for comprehensive analysis of small-RNA experiments or clean-up and QC of paired-end sequencing data in general.

                  Comment


                  • #10
                    Unfortunate overlap - namewise - with REAPR, http://www.sanger.ac.uk/resources/software/reapr/ ...

                    Comment


                    • #11
                      Parameter-sweep - comparison of clipping programs:

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-27-2024, 06:37 PM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-27-2024, 06:07 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      69 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X